This is a discussion on count * performance issue within the Pgsql Performance forums, part of the PostgreSQL category; --> On Thu, Mar 06, 2008 at 07:28:50AM -0800, Craig James wrote: > Count() on Oracle and MySQL is almost ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Thu, Mar 06, 2008 at 07:28:50AM -0800, Craig James wrote: > Count() on Oracle and MySQL is almost instantaneous, even for very large > tables. So why can't Postgres do what they do? In MySQL's case: Handle transactions. (Try COUNT(*) on an InnoDB table.) /* Steinar */ -- Homepage: http://www.sesse.net/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://mail.postgresql.org/mj/mj_www...ql-performance |
| |||
| On Thu, 6 Mar 2008, Steinar H. Gunderson wrote: > On Thu, Mar 06, 2008 at 07:28:50AM -0800, Craig James wrote: >> Count() on Oracle and MySQL is almost instantaneous, even for very large >> tables. So why can't Postgres do what they do? > > In MySQL's case: Handle transactions. (Try COUNT(*) on an InnoDB table.) Exactly. There is a good discussion of this at http://www.mysqlperformanceblog.com/...t-vs-countcol/ and I found the comments from Ken Jacobs were the most informative. In short, if you want any reasonable database integrity you have to use InnoDB with MySQL, and once you make that choice it has the same problem. You only get this accelerated significantly when using MyISAM, which can tell you an exact count of all the rows it hasn't corrupted yet. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://mail.postgresql.org/mj/mj_www...ql-performance |
| |||
| In response to Craig James <craig_james@emolecules.com>: > In the 3 years I've been using Postgres, the problem of count() performance has come up more times than I can recall, and each time the answer is, "It's a sequential scan -- redesign your application." > > My question is: What do the other databases do that Postgres can't do, and why not? > > Count() on Oracle and MySQL is almost instantaneous, even for very large tables. So why can't Postgres do what they do? I don't know about Oracle, but MySQL has this problem as well. Use innodb tables and see how slow it is. The only reason myisam tables don't have this problem is because they don't implement any of the features that make the problem difficult to solve. > On the one hand, I understand that Postgres has its architecture, and I understand the issue of row visibility, and so forth. On the other hand, my database is just sitting there, nothing going on, no connections except me, and... it takes FIFTY FIVE SECONDS to count 20 million rows, a query that either Oracle or MySQL would answer in a fraction of a second. It's hard for me to believe there isn't a better way. There's been discussion about putting visibility information in indexes. I don't know how far along that effort is, but I expect that will improve count() performance significantly. > This is a real problem. Countless people (including me) have spent significant effort rewriting applications because of this performance flaw in Postgres. Over and over, the response is, "You don't really need to do that ... change your application." Well, sure, it's always possible to change the application, but that misses the point. To most of us users, count() seems like it should be a trivial operation. On other relational database systems, it is a trivial operation. > > This is really a significant flaw on an otherwise excellent relational database system. Not really. It really is a design flaw in your application ... it doesn't make relational sense to use the number of rows in a table for anything. Just because other people do it frequently doesn't make it right. That being said, it's still a useful feature, and I don't hear anyone denying that. As I said, google around a bit WRT to PG storing visibility information in indexes, as I think that's the way this will be improved. > My rant for today... Feel better now? -- Bill Moran -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://mail.postgresql.org/mj/mj_www...ql-performance |
| |||
| Craig James <craig_james@emolecules.com> writes: > Count() on Oracle and MySQL is almost instantaneous, even for very large tables. So why can't Postgres do what they do? AFAIK the above claim is false for Oracle. They have the same transactional issues we do. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://mail.postgresql.org/mj/mj_www...ql-performance |
| |||
| Craig James wrote: > > My question is: What do the other databases do that Postgres can't do, > and why not? > > Count() on Oracle and MySQL is almost instantaneous, even for very > large tables. So why can't Postgres do what they do? > I think Mysql can only do that for the myisam engine - innodb and falcon are similar to Postgres. I don't believe Oracle optimizes bare count(*) on a table either - tho it may be able to use a suitable index (if present) to get the answer quicker. regards Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| Tom Lane wrote: > Craig James <craig_james@emolecules.com> writes: >> Count() on Oracle and MySQL is almost instantaneous, even for very large tables. So why can't Postgres do what they do? > > AFAIK the above claim is false for Oracle. They have the same > transactional issues we do. My experience doesn't match this claim. When I ported my application from Oracle to Postgres, this was the single biggest performance problem. count() in Oracle was always very fast. We're not talking about a 20% or 50% difference, we're talking about a small fraction of a second (Oracle) versus a minute (Postgres) -- something like two or three orders of magnitude. It may be that Oracle has a way to detect when there's no transaction and use a faster method. If so, this was a clever optimization -- in my experience, that represents the vast majority of the times you want to use count(). It's not very useful to count the rows of a table that many apps are actively modifying since the result may change the moment your transaction completes. Most of the time when you use count(), it's because you're the only one modifying the table, so the count will be meaningful. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| Craig James wrote: > Tom Lane wrote: >> Craig James <craig_james@emolecules.com> writes: >>> Count() on Oracle and MySQL is almost instantaneous, even for very >>> large tables. So why can't Postgres do what they do? >> >> AFAIK the above claim is false for Oracle. They have the same >> transactional issues we do. > > My experience doesn't match this claim. When I ported my application > from Oracle to Postgres, this was the single biggest performance > problem. count() in Oracle was always very fast. We're not talking > about a 20% or 50% difference, we're talking about a small fraction of > a second (Oracle) versus a minute (Postgres) -- something like two or > three orders of magnitude. > > It may be that Oracle has a way to detect when there's no transaction > and use a faster method. If so, this was a clever optimization -- in > my experience, that represents the vast majority of the times you want > to use count(). It's not very useful to count the rows of a table > that many apps are actively modifying since the result may change the > moment your transaction completes. Most of the time when you use > count(), it's because you're the only one modifying the table, so the > count will be meaningful. > > Craig > > Oracle will use a btree index on a not null set of columns to do a fast full index scan, which can be an order of magnitude or faster compared to a table scan. Also, Oracle can use a bitmap index (in cases where a bitmap index isn't otherwise silly) for a bitmap fast index scan/bitmap conversion for similar dramatic results. For "large" tables, Oracle is not going to be as fast as MyISAM tables in MySQL, even with these optimizations, since MyISAM doesn't have to scan even index pages to get a count(*) answer against the full table. Paul -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| Mark Mielke wrote: > Josh Berkus wrote: >>>> Count() on Oracle and MySQL is almost instantaneous, even for very >>>> large tables. So why can't Postgres do what they do? >>>> >>> AFAIK the above claim is false for Oracle. They have the same >>> transactional issues we do. >>> >> >> Nope. Oracle's MVCC is implemented through rollback segments, rather than >> non-overwriting the way ours is. So Oracle can just do a count(*) on the >> index, then check the rollback segment for any concurrent >> update/delete/insert activity and adjust the count. This sucks if there's >> a *lot* of concurrent activity, but in the usual case it's pretty fast > > I read the "almost instantaneous" against "the above claim is false" and > "Nope.", and I am not sure from the above whether you are saying that > Oracle keeps an up-to-date count for the index (which might make it > instantaneous?), or whether you are saying it still has to scan the > index - which can take time if the index is large (therefore not > instantaneous). > > Cheers, > mark > > -- > Mark Mielke <mark@mielke.cc> > Oracle scans the index pages, if the b-tree index is on non-nullable columns, or if the bitmap index is on low-ish cardinality data. Otherwise, it table scans. MyISAM in MySQL would be an example where a counter is kept. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On 6-3-2008 16:28 Craig James wrote: > On the one hand, I understand that Postgres has its architecture, and I > understand the issue of row visibility, and so forth. On the other > hand, my database is just sitting there, nothing going on, no > connections except me, and... it takes FIFTY FIVE SECONDS to count 20 > million rows, a query that either Oracle or MySQL would answer in a > fraction of a second. It's hard for me to believe there isn't a better > way. Can you explain to me how you'd fit this in a fraction of a second? mysql> select count(*) from messages; +----------+ | count(*) | +----------+ | 21908505 | +----------+ 1 row in set (8 min 35.09 sec) This is a table containing the messages on forumtopics and is therefore relatively large. The hardware is quite beefy for a forum however (4 3Ghz cores, 16GB, 14+1 disk raid5). This table has about 20GB of data. If I use a table that contains about the same amount of records as the above and was before this query probably much less present in the innodb-buffer (but also less frequently touched by other queries), we see this: mysql> select count(*) from messagesraw; +----------+ | count(*) | +----------+ | 21962804 | +----------+ 1 row in set (5 min 16.41 sec) This table is about 12GB. In both cases MySQL claimed to be 'Using index' with the PRIMARY index, which for those tables is more or less identical. Apparently the time is still table-size related, not necessarily tuple-count related. As this shows: mysql> select count(*) from articlestats; +----------+ | count(*) | +----------+ | 34467246 | +----------+ 1 row in set (54.14 sec) that table is only 2.4GB, but contains 57% more records, although this was on another database on a system with somewhat different specs (8 2.6Ghz cores, 16GB, 7+7+1 raid50), used a non-primary index and I have no idea how well that index was in the system's cache prior to this query. Repeating it makes it do that query in 6.65 seconds, repeating the 12GB-query doesn't make it any faster. Anyway, long story short: MySQL's table-count stuff also seems table-size related. As soon as the index it uses fits in the cache or it doesn't have to use the primary index, it might be a different story, but when the table(index) is too large to fit, it is quite slow. Actually, it doesn't appear to be much faster than Postgresql's (8.2) table-based counts. If I use a much slower machine (2 2Ghz opterons, 8GB ddr memory, 5+1 old 15k rpm scsi disks in raid5) with a 1GB, 13M record table wich is similar to the above articlestats, it is able to return a count(*) in 3 seconds after priming the cache. If you saw instantaneous results with MySQL, you have either seen the query-cache at work or where using myisam. Or perhaps with a fast system, you had small tuples with a nice index in a nicely primed cache. Best regards, Arjen -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| ||||
| "Tom Lane" <tgl@sss.pgh.pa.us> writes: > Well, scanning an index to get a count might be significantly faster > than scanning the main table, but it's hardly "instantaneous". It's > still going to take time proportional to the table size. Hm, Mark's comment about bitmap indexes makes that not entirely true. A bitmap index can do RLE compression which makes the relationship between the size of the table and the time taken to scan the index more complex. In the degenerate case where there are no concurrent updates (assuming you can determine that quickly) it might actually be constant time. > Unless they keep a central counter of the number of index entries; > which would have all the same serialization penalties we've talked > about before... Bitmap indexes do in fact have concurrency issues -- arguably they're just a baroque version of this central counter in this case. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| Thread Tools | |
| Display Modes | |
|
|