This is a discussion on Experiences with extensibility within the Pgsql General forums, part of the PostgreSQL category; --> On 08.01.2008, at 23:20, Joshua D. Drake wrote: > That isn't really an extensibility argument. I was thinking about ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On 08.01.2008, at 23:20, Joshua D. Drake wrote: > That isn't really an extensibility argument. I was thinking about that too - for me, it still is just an outstanding issue with PostgreSQL. It is incredibly scalable on one machine but it totally sucks when you want more, but not much more. Like, I have a situation where I need multi-master just for availability. Two small servers are good enough for that. But unfortunately with PostgreSQL the whole setup is a major pain in the ... > At least not in my mind. Further I don't know of anyone that can > "easily" do it. You either suffer the possibility of catastrophic > data loss (dolphins) or you suffer guaranteed bank account drainage > (Oracle), or you suffer the willingness of Monopolies (MSSQL). FrontBase. It has an incredibly easy to configure replication and multi master clustering support, is very reliable and can also handle really big databases. The only problem is that the query planner is not as good as PostgreSQL's so you might end up with much worse performance. Depends a bit on the complexity of the database and how "special" your queries are. But if you need something easy to setup, multi-master with just two machines, easy fail-over (done in the JDBC driver) without your application even noticing it - try it. It's free, but not open source. And it's a good product. I use it for some stuff and PostgreSQL for other projects. Just depends on the requirements. cug -- http://www.event-s.net ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Sim Zacks wrote: > We use postgresql because it is open source, we have in-house experience > to deal with it so we don't have any extra support costs and we don't > need the features that are offered in commercial products that > PostGreSQL does not have. We also don't need the speed that commercial > products offer that is missing in PostgreSQL. I use PostgreSQL because it has a ton of features the closed source products don't offer and is generally faster than the closed source solutions. Granted there are scenarios where others are FASTER (SELECT COUNT(*)) but I find that if you are doing those items, you normally have a weird design anyway. Sincerely, Joshua D. Drake ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| On Wed, 09 Jan 2008 00:06:45 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > Granted there are scenarios where others are FASTER (SELECT > COUNT(*)) but I find that if you are doing those items, you > normally have a weird design anyway. > Sincerely, Sincerely, would you make an example of such a bad design? Or did you just mean that count(*) is bad design in postgresql since there are usually better alternatives in postgresql? I'm not joking. I'd like to learn. -- Ivan Sergio Borgonovo http://www.webthatworks.it ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto > owner@postgresql.org] On Behalf Of Ivan Sergio Borgonovo > Sent: Wednesday, January 09, 2008 1:30 AM > To: pgsql-general@postgresql.org > Subject: [GENERAL] count(*) and bad design was: Experiences with > extensibility > > On Wed, 09 Jan 2008 00:06:45 -0800 > "Joshua D. Drake" <jd@commandprompt.com> wrote: > > > Granted there are scenarios where others are FASTER (SELECT > > COUNT(*)) but I find that if you are doing those items, you > > normally have a weird design anyway. > > > Sincerely, > > Sincerely, would you make an example of such a bad design? A program that estimates cardinality by doing SELECT COUNT(*) is a bad design. Assuming you have the wherewithal to vacuum your tables (or have autovacuum enabled) a query against the system tables will be a much better estimate of cardinality. Now (some may argue) what if we want an _EXACT_ value for COUNT(*)? We had better ask ourselves (in that circumstance) "Am I willing to lock the entire table and scan it?" because that is what will be necessary to get a truly exact value. Otherwise, you can get totals that are wildly off-base if someone is doing a bulk import or deleting a large number of records. So: SELECT reltuples FROM pg_class WHERE relname = <table_name>; Is more often what is really wanted. > Or did you just mean that count(*) is bad design in postgresql since > there are usually better alternatives in postgresql? If you are using COUNT(*) as an existence test, then substitute: WHERE EXISTS(<criteria>) Use the indexes (if possible) by WHERE clause restriction: SELECT count(1) FROM <table_name> WHERE <condition_list> Will use indexes if appropriate. > I'm not joking. I'd like to learn. I think this should be a FAQ because it is a (F)requently (A)sked (Q)uestion. IMO-YMMV. > -- > Ivan Sergio Borgonovo > http://www.webthatworks.it > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Wed, 9 Jan 2008 01:39:34 -0800 "Dann Corbit" <DCorbit@connx.com> wrote: > > On Wed, 09 Jan 2008 00:06:45 -0800 > > "Joshua D. Drake" <jd@commandprompt.com> wrote: > > > Granted there are scenarios where others are FASTER (SELECT > > > COUNT(*)) but I find that if you are doing those items, you > > > normally have a weird design anyway. > > > Sincerely, > > Sincerely, would you make an example of such a bad design? > A program that estimates cardinality by doing SELECT COUNT(*) is a > bad design. Assuming you have the wherewithal to vacuum your > tables (or have autovacuum enabled) a query against the system > tables will be a much better estimate of cardinality. > > Now (some may argue) what if we want an _EXACT_ value for > COUNT(*)? We had better ask ourselves (in that circumstance) "Am I > willing to lock the entire table and scan it?" because that is what > will be necessary to get a truly exact value. Otherwise, you can > get totals that are wildly off-base if someone is doing a bulk > import or deleting a large number of records. Please forgive my naiveness in this field but what does it mean an "exact count" and what other DB means with "an exact count" and how other DB deal with it? How "count" is defined in the SQL standard? Is there a real situation then where you really need the "exact" count? Am I right saying that: select count(*) from ... -- here count may already be different and that: select for update count(*) could be very expensive? Or what would it mean to do a select for update count(*) ...? > I think this should be a FAQ because it is a (F)requently (A)sked > (Q)uestion. After reading your email I think the real problem is not how to do otherwise but understand what count(*) really mean and when and if it is really useful and when it can be avoided. I'd write in the FAQ something in the line of: - What count(*) really does? - When it can be avoided? - When it can't be avoided? In my everyday use of count(*), after your email I can hardly spot a place where I need an exact count. But to better understand and being convinced that using count(*) is bad design I think last question could help a lot. How does count(*) with a where clause perform generally on postgresql compared to other DB? I'd expect it perform as good or better than other DB since now the bottleneck should be how efficiently it can filter records... but still a count(*) with a where clause will incur in the same problem of what "exact" means. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Ivan, > Please forgive my naiveness in this field but what does it mean an > "exact count" and what other DB means with "an exact count" and how > other DB deal with it? PostgreSQL will give you an exact count of the contents of the database as it is in the moment you begin your count. (i.e. the transaction starts) BUT as the table is not locked, in parallel somebody can bulkload MANY items into the database, so at the moment (start of your transaction) + 1msec your count may be invalid allready. > I'd expect it perform as good or better than other DB since now the > bottleneck should be how efficiently it can filter records... but > still a count(*) with a where clause will incur in the same problem > of what "exact" means. I know of 3 concepts to answer count() faster then PostreSQL: 1) just lie, present an estimate 2) do not have MVCC 3) store "record deleted info" in index, so you can answer count() with only scanning the index Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 fx 01212-5-13695179 - EuroPython 2008 will take place in Vilnius, Lithuania - Stay tuned! ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| I believe I was misunderstood. The fact that a product is closed source does not make it a better product. Some companies that are using Oracle would be better off using PostgreSQL. Other companies that need the features that Oracle offers would not be better off using Postgresql. However, there are a lot of closed source products that are better then their open source counter-parts. Oracle vs. Postgresql is one of them. Obviously if you don't need the feature-set provided by Oracle then you would be foolish for paying for it. But if you do need the extra features, then it is worth it. Sim Clodoaldo wrote: > 2008/1/9, Sim Zacks <sim@compulab.co.il>: >> The reason companies go with the closed source, expensive solutions is because >> they are better products. > > Not necessarily. FOSS products don't have a selling team to persuade > and bribe people. Expensive solutions, and that is in part what make > them expensive, can spend lots of time persuading and can offer good > money to those who decide which is the "best" product. Those who > decide are not the coders or db admins and in general don't really > care much. > > Regards, Clodoaldo Pinto Neto > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > |
| |||
| On Tue, Jan 08, 2008 at 11:37:38PM -0700, Guido Neitzer wrote: > Like, I have a situation where I need multi-master just for > availability. Two small servers are good enough for that. But > unfortunately with PostgreSQL the whole setup is a major pain in the ... Really? I don't think a RAID array with heartbeat and failover is that big a pain. It requires some careful implementation, but it can be made to work well, I think. A ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Wed, 09 Jan 2008 16:33:54 +0200 Sim Zacks <sim@compulab.co.il> wrote: > Using count(*) is not bad design, though generally it makes sense > to use it with a where. I got the impression from others comments that postgresql under perform other DB even when a where clause on indexed column is involved. I may have misinterpreted this but still could someone clarify? Harald Armin Massa wrote: > 1) just lie, present an estimate > 2) do not have MVCC > 3) store "record deleted info" in index, so you can answer count() > with only scanning the index I got the impression that even counting with clauses on on indexed columns means you'll have to check if columns are still there. That seems to imply that the extra cost make pg under perform compared to other DB even in that scenario. I beg pardon to Harald if I misinterpreted his words. > Saying using count(*) is bad design means that the only design that > you can visualize is the specific one that you are using. I'd be interested in some example by Joshua otherwise I can't understand what he meant. If you're interested in all the record in a table, there is no way to have an "engraved in stone" answer and since there are no "where clauses" you can cache that info and update it once in a while. If you have a where clause I'm not expecting an "engraved in stone" answer but I'd expect to have a quick way to get an estimate and I still haven't understood if: a) I could improve my design to avoid count b) postgres perform as good as other db on count where there is a where clause c) is there a way to have a quick estimate avoiding count when there is a where clause > There are tons of real world examples where you need count. That is > why so many people use it as a benchmark. > Obviously if you have an application where millions of rows are > added and subtracted every minute, then the value of count is kind > of vague. > However, if you are querying a data warehouse that gets populated > once a month, then count has a huge value. You could use statistics > in that case, but it is much harder to join against other tables > when you are using statistics. It is also less intuitive then using > the SQL standard for finding the number of rows. Again: paging records. You can't do statistics. Surely you could optimise and lie... but that comes to a cost compared to the simplicity of count. Still everybody knows that a frequent complain about postgresql is it has a slow count. I can understand grey answer provided they are coherent. a) the above claim is false b) this claim is true just on cases where you could opt for a better design c) this claim is false for count without where clause d) this claim is true e) ... details on b) would be much appreciated. Other cases require just a yes/no answer. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ |
| ||||
| On Jan 9, 2008 8:33 AM, Sim Zacks <sim@compulab.co.il> wrote: > Using count(*) is not bad design, though generally it makes sense to use it with > a where. > > Saying using count(*) is bad design means that the only design that you can > visualize is the specific one that you are using. > > There are tons of real world examples where you need count. That is why so many > people use it as a benchmark. > > Obviously if you have an application where millions of rows are added and > subtracted every minute, then the value of count is kind of vague. > > However, if you are querying a data warehouse that gets populated once a month, > then count has a huge value. You could use statistics in that case, but it is > much harder to join against other tables when you are using statistics. It is > also less intuitive then using the SQL standard for finding the number of rows. I think part of the problem is people think of count() as something other than an aggregate function. If I ran "select avg(i) from table" on a 20M row table, I'd expect it to take a few seconds, after all, I'm running a function across 20 Million rows. Some databases have the ability to short-circuit count(*) without a where clause, some with a where clause. But the basic model of count(*) is that it's an aggregate function, and what you're asking the db to do is to count every single row in the db that matches your where clause. Now, everything's a tradeoff. If PostgreSQL had visibility information in the indexes, it would have to lock both the table and index for every write, thus slowing down all the other queries that are trying to access the table. It would be a tradeoff that sacrificed write speed for read speed. In a db that was used mostly for writing, it would likely be a fair trade. In a db that did a lot of writing, it might slow the whole thing to a crawl. So, the slow count(*) performance of postgresql, especially count(*) without a where clause, is a result of the TANSTAAFL principle (there ain't no such thing as a free lunch). Now, if there's a where clause that's selective enough, then a count(*) query may use the index and be a bit faster, but remember, in pgsql, it's still got to actually hit the table to see if each tuple really is visible to this transaction, so the index needs to be fairly selective to be a win. A possible workaround is to have something like a separate table with nothing but the IDs and whatever would be in your where clause for the tables you're accessing with a foreign key to it, and use THAT for a count(*). Since the rows are skinnier, the count(*) will be faster. Another alternative is to have a trigger fire that keeps a track of the size of the table in a summary table when rows are added and deleted. Each of these methods "costs" you something, in time and effort or performance, and that's why they're not automatic. For instance, I have an 80M row stats db that grows by about 1M rows a week. I do NOT need to count the whole thing, hardly ever, and can live with the fact that I don't know exactly how many rows it has at any given time. When I do a select count(*) with a where clause it's usually restricted to < 1 weeks data and can use an index and come back pretty quickly. select count(*) from bigtable where inserttime > now() - interval '1 week'; count --------- 1254269 (1 row) Time: 21422.368 ms Second run: select count(*) from bigtable where inserttime > now() - interval '1 week'; count --------- 1254320 (1 row) Time: 2001.204 ms With the data loaded into shared_buffers / linux kernel cache, that's not too bad. Now, I run the same query against our production oracle machine, which is a MUCH more powerful server... SQL> select count(*) from bigtable where inserttime > SYSDATE-7; COUNT(*) ---------- 1255972 Elapsed: 00:00:18.62 second run: SQL> select count(*) from bigtable where inserttime > SYSDATE-7; COUNT(*) ---------- 1255973 Elapsed: 00:00:00.98 Now, obviously, Oracle's got some optimizations for what it's got in the buffer there, but the first run isn't really any faster. In fact, for a much more powerful machine, the performance was, relatively speaking, pretty bad compared to my little 1 CPU 1 sw RAID-10 reporting server. So, while PostgreSQL's count(*) performance isn't blindingly fast, it's not the dog some people make it out to be either. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |