This is a discussion on Performance of count(*) within the Pgsql Performance forums, part of the PostgreSQL category; --> As you can see, PostgreSQL needs to do a sequencial scan to count because its MVCC nature and indices ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| As you can see, PostgreSQL needs to do a sequencial scan to count because its MVCC nature and indices don't have transaction information. It's a known drawback inherent to the way PostgreSQL works and which gives very good results in other areas. It's been talked about adding some kind of approximated count which wouldn't need a full table scan but I don't think there's anything there right now. A Dijous 22 Març 2007 11:53, Andreas Tille va escriure: > Hi, > > I just try to find out why a simple count(*) might last that long. > At first I tried explain, which rather quickly knows how many rows > to check, but the final count is two orders of magnitude slower. > > My MS_SQL server using colleague can't believe that. > > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;' > QUERY PLAN > ----------------------------------------------------------------------- > Aggregate (cost=196969.77..196969.77 rows=1 width=0) > -> Seq Scan on agiraw (cost=0.00..185197.41 rows=4708941 width=0) > (2 rows) > > real 0m0.066s > user 0m0.024s > sys 0m0.008s > > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;' > count > --------- > 4708941 > (1 row) > > real 0m4.474s > user 0m0.036s > sys 0m0.004s > > > Any explanation? > > Kind regards > > Andreas. -- Albert Cervera Areny Dept. Informàtica Sedifa, S.L. Av. Can Bordoll, 149 08202 - Sabadell (Barcelona) Tel. 93 715 51 11 Fax. 93 715 51 12 ================================================== ================== ......................... AVISO LEGAL ............................ La presente comunicación y sus anexos tiene como destinatario la persona a la que va dirigida, por lo que si usted lo recibe por error debe notificarlo al remitente y eliminarlo de su sistema, no pudiendo utilizarlo, total o parcialmente, para ningún fin. Su contenido puede tener información confidencial o protegida legalmente y únicamente expresa la opinión del remitente. El uso del correo electrónico vía Internet no permite asegurar ni la confidencialidad de los mensajes ni su correcta recepción. En el caso de que el destinatario no consintiera la utilización del correo electrónico, deberá ponerlo en nuestro conocimiento inmediatamente. ================================================== ================== ............................ DISCLAIMER ............................. This message and its attachments are intended exclusively for the named addressee. If you receive this message in error, please immediately delete it from your system and notify the sender. You may not use this message or any part of it for any purpose. The message may contain information that is confidential or protected by law, and any opinions expressed are those of the individual sender. Internet e-mail guarantees neither the confidentiality nor the proper receipt of the message sent. If the addressee of this message does not consent to the use of internet e-mail, please inform us inmmediately. ================================================== ================== ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| explain is just "quessing" how many rows are in table. sometimes quess is right, sometimes just an estimate. sailabdb=# explain SELECT count(*) from sl_tuote; QUERY PLAN ---------------------------------------------------------------------- Aggregate (cost=10187.10..10187.11 rows=1 width=0) -> Seq Scan on sl_tuote (cost=0.00..9806.08 rows=152408 width=0) (2 rows) sailabdb=# SELECT count(*) from sl_tuote; count ------- 62073 (1 row) so in that case explain estimates that sl_tuote table have 152408 rows, but there are only 62073 rows. after analyze estimates are better: sailabdb=# vacuum analyze sl_tuote; VACUUM sailabdb=# explain SELECT count(*) from sl_tuote; QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=9057.91..9057.92 rows=1 width=0) -> Seq Scan on sl_tuote (cost=0.00..8902.73 rows=62073 width=0) (2 rows) you can't never trust that estimate, you must always count it! Ismo On Thu, 22 Mar 2007, Andreas Tille wrote: > Hi, > > I just try to find out why a simple count(*) might last that long. > At first I tried explain, which rather quickly knows how many rows > to check, but the final count is two orders of magnitude slower. > > My MS_SQL server using colleague can't believe that. > > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;' > QUERY PLAN > ----------------------------------------------------------------------- > Aggregate (cost=196969.77..196969.77 rows=1 width=0) > -> Seq Scan on agiraw (cost=0.00..185197.41 rows=4708941 width=0) > (2 rows) > > real 0m0.066s > user 0m0.024s > sys 0m0.008s > > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;' > count --------- > 4708941 > (1 row) > > real 0m4.474s > user 0m0.036s > sys 0m0.004s > > > Any explanation? > > Kind regards > > Andreas. > > -- > http://fam-tille.de > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| * Andreas Tille <tillea@rki.de> [070322 12:07]: > Hi, > > I just try to find out why a simple count(*) might last that long. > At first I tried explain, which rather quickly knows how many rows > to check, but the final count is two orders of magnitude slower. Which version of PG? The basic problem is, that explain knows quickly, because it has it's statistics. The select proper, OTOH, has to go through the whole table to make sure which rows are valid for your transaction. That's the reason why PG (check the newest releases, I seem to remember that there has been some aggregate optimizations there), does a SeqScan for select count(*) from table. btw, depending upon your data, doing a select count(*) from table where user=X will use an Index, but will still need to fetch the rows proper to validate them. Andreas > > My MS_SQL server using colleague can't believe that. > > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;' > QUERY PLAN ----------------------------------------------------------------------- > Aggregate (cost=196969.77..196969.77 rows=1 width=0) > -> Seq Scan on agiraw (cost=0.00..185197.41 rows=4708941 width=0) > (2 rows) > > real 0m0.066s > user 0m0.024s > sys 0m0.008s > > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;' > count --------- > 4708941 > (1 row) > > real 0m4.474s > user 0m0.036s > sys 0m0.004s > > > Any explanation? > > Kind regards > > Andreas. > > -- > http://fam-tille.de > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| * Mario Weilguni <mweilguni@sime.com> [070322 15:59]: > Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris: > > On 3/22/07, Merlin Moncure <mmoncure@gmail.com> wrote: > > > As others suggest select count(*) from table is very special case > > > which non-mvcc databases can optimize for. > > > > Well, other MVCC database still do it faster than we do. However, I > > think we'll be able to use the dead space map for speeding this up a > > bit wouldn't we? > > Which MVCC DB do you mean? Just curious... Well, mysql claims InnoDB to be mvcc Andreas ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| In response to ismo.tuononen@solenovo.fi: > > approximated count????? > > why? who would need it? where you can use it? > > calculating costs and desiding how to execute query needs > approximated count, but it's totally worthless information for any user > IMO. I don't think so. We have some AJAX stuff where users enter search criteria on a web form, and the # of results updates in "real time" as they change their criteria. Right now, this works fine with small tables using count(*) -- it's fast enough not to be an issue, but we're aware that we can't use it on large tables. An estimate_count(*) or similar that would allow us to put an estimate of how many results will be returned (not guaranteed accurate) would be very nice to have in these cases. We're dealing with complex sets of criteria. It's very useful for the users to know in "real time" how much their search criteria is effecting the result pool. Once they feel they've limited as much as they can without reducing the pool too much, they can hit submit and get the actual result. As I said, we do this with small data sets, but it's not terribly useful there. Where it will be useful is searches of large data sets, where constantly submitting and then retrying is overly time-consuming. Of course, this is count(*)ing the results of a complex query, possibly with a bunch of joins and many limitations in the WHERE clause, so I'm not sure what could be done overall to improve the response time. > On Thu, 22 Mar 2007, Albert Cervera Areny wrote: > > > As you can see, PostgreSQL needs to do a sequencial scan to count because its > > MVCC nature and indices don't have transaction information. It's a known > > drawback inherent to the way PostgreSQL works and which gives very good > > results in other areas. It's been talked about adding some kind of > > approximated count which wouldn't need a full table scan but I don't think > > there's anything there right now. > > > > A Dijous 22 Març 2007 11:53, Andreas Tille va escriure: > > > Hi, > > > > > > I just try to find out why a simple count(*) might last that long. > > > At first I tried explain, which rather quickly knows how many rows > > > to check, but the final count is two orders of magnitude slower. > > > > > > My MS_SQL server using colleague can't believe that. > > > > > > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;' > > > QUERY PLAN > > > ----------------------------------------------------------------------- > > > Aggregate (cost=196969.77..196969.77 rows=1 width=0) > > > -> Seq Scan on agiraw (cost=0.00..185197.41 rows=4708941 width=0) > > > (2 rows) > > > > > > real 0m0.066s > > > user 0m0.024s > > > sys 0m0.008s > > > > > > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;' > > > count > > > --------- > > > 4708941 > > > (1 row) > > > > > > real 0m4.474s > > > user 0m0.036s > > > sys 0m0.004s > > > > > > > > > Any explanation? > > > > > > Kind regards > > > > > > Andreas. > > > > -- > > Albert Cervera Areny > > Dept. Informàtica Sedifa, S.L. > > > > Av. Can Bordoll, 149 > > 08202 - Sabadell (Barcelona) > > Tel. 93 715 51 11 > > Fax. 93 715 51 12 > > > > ================================================== ================== > > ........................ AVISO LEGAL ............................ > > La presente comunicación y sus anexos tiene como destinatario la > > persona a la que va dirigida, por lo que si usted lo recibe > > por error debe notificarlo al remitente y eliminarlo de su > > sistema, no pudiendo utilizarlo, total o parcialmente, para > > ningún fin. Su contenido puede tener información confidencial o > > protegida legalmente y únicamente expresa la opinión del > > remitente. El uso del correo electrónico vía Internet no > > permite asegurar ni la confidencialidad de los mensajes > > ni su correcta recepción. En el caso de que el > > destinatario no consintiera la utilización del correo electrónico, > > deberá ponerlo en nuestro conocimiento inmediatamente. > > ================================================== ================== > > ........................... DISCLAIMER ............................. > > This message and its attachments are intended exclusively for the > > named addressee. If you receive this message in error, please > > immediately delete it from your system and notify the sender. You > > may not use this message or any part of it for any purpose. > > The message may contain information that is confidential or > > protected by law, and any opinions expressed are those of the > > individual sender. Internet e-mail guarantees neither the > > confidentiality nor the proper receipt of the message sent. > > If the addressee of this message does not consent to the use > > of internet e-mail, please inform us inmmediately. > > ================================================== ================== > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: You can help support the PostgreSQL project by donating at > > > > http://www.postgresql.org/about/donate > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > > > > > > -- Bill Moran Collaborative Fusion Inc. wmoran@collaborativefusion.com Phone: 412-422-3463x4023 ************************************************** ************** IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ************************************************** ************** ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Michael Stone wrote: > On Thu, Mar 22, 2007 at 01:30:35PM +0200, ismo.tuononen@solenovo.fi wrote: >> approximated count????? >> >> why? who would need it? where you can use it? > > Do a google query. Look at the top of the page, where it says "results N > to M of about O". For user interfaces (which is where a lot of this > count(*) stuff comes from) you quite likely don't care about the exact > count... Right on, Michael. One of our biggest single problems is this very thing. It's not a Postgres problem specifically, but more embedded in the idea of a relational database: There are no "job status" or "rough estimate of results" or "give me part of the answer" features that are critical to many real applications. In our case (for a variety of reasons, but this one is critical), we actually can't use Postgres indexing at all -- we wrote an entirely separate indexing system for our data, one that has the following properties: 1. It can give out "pages" of information (i.e. "rows 50-60") without rescanning the skipped pages the way "limit/offset" would. 2. It can give accurate estimates of the total rows that will be returned. 3. It can accurately estimate the time it will take. For our primary business-critical data, Postgres is merely a storage system, not a search system, because we have to do the "heavy lifting" in our own code. (To be fair, there is no relational database that can handle our data.) Many or most web-based search engines face these exact problems. Craig ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| approximated count????? why? who would need it? where you can use it? calculating costs and desiding how to execute query needs approximated count, but it's totally worthless information for any user IMO. Ismo On Thu, 22 Mar 2007, Albert Cervera Areny wrote: > As you can see, PostgreSQL needs to do a sequencial scan to count because its > MVCC nature and indices don't have transaction information. It's a known > drawback inherent to the way PostgreSQL works and which gives very good > results in other areas. It's been talked about adding some kind of > approximated count which wouldn't need a full table scan but I don't think > there's anything there right now. > > A Dijous 22 Març 2007 11:53, Andreas Tille va escriure: > > Hi, > > > > I just try to find out why a simple count(*) might last that long. > > At first I tried explain, which rather quickly knows how many rows > > to check, but the final count is two orders of magnitude slower. > > > > My MS_SQL server using colleague can't believe that. > > > > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;' > > QUERY PLAN > > ----------------------------------------------------------------------- > > Aggregate (cost=196969.77..196969.77 rows=1 width=0) > > -> Seq Scan on agiraw (cost=0.00..185197.41 rows=4708941 width=0) > > (2 rows) > > > > real 0m0.066s > > user 0m0.024s > > sys 0m0.008s > > > > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;' > > count > > --------- > > 4708941 > > (1 row) > > > > real 0m4.474s > > user 0m0.036s > > sys 0m0.004s > > > > > > Any explanation? > > > > Kind regards > > > > Andreas. > > -- > Albert Cervera Areny > Dept. Informàtica Sedifa, S.L. > > Av. Can Bordoll, 149 > 08202 - Sabadell (Barcelona) > Tel. 93 715 51 11 > Fax. 93 715 51 12 > > ================================================== ================== > ........................ AVISO LEGAL ............................ > La presente comunicación y sus anexos tiene como destinatario la > persona a la que va dirigida, por lo que si usted lo recibe > por error debe notificarlo al remitente y eliminarlo de su > sistema, no pudiendo utilizarlo, total o parcialmente, para > ningún fin. Su contenido puede tener información confidencial o > protegida legalmente y únicamente expresa la opinión del > remitente. El uso del correo electrónico vía Internet no > permite asegurar ni la confidencialidad de los mensajes > ni su correcta recepción. En el caso de que el > destinatario no consintiera la utilización del correo electrónico, > deberá ponerlo en nuestro conocimiento inmediatamente. > ================================================== ================== > ........................... DISCLAIMER ............................. > This message and its attachments are intended exclusively for the > named addressee. If you receive this message in error, please > immediately delete it from your system and notify the sender. You > may not use this message or any part of it for any purpose. > The message may contain information that is confidential or > protected by law, and any opinions expressed are those of the > individual sender. Internet e-mail guarantees neither the > confidentiality nor the proper receipt of the message sent. > If the addressee of this message does not consent to the use > of internet e-mail, please inform us inmmediately. > ================================================== ================== > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Thu, Mar 22, 2007 at 01:30:35PM +0200, ismo.tuononen@solenovo.fi wrote: >approximated count????? > >why? who would need it? where you can use it? Do a google query. Look at the top of the page, where it says "results N to M of about O". For user interfaces (which is where a lot of this count(*) stuff comes from) you quite likely don't care about the exact count, because the user doesn't really care about the exact count. IIRC, that's basically what you get with the mysql count anyway, since there are corner cases for results in a transaction. Avoiding those cases is why the postgres count takes so long; sometimes that's what's desired and sometimes it is not. Mike Stone ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Thu, 22 Mar 2007, Andreas Kostyrka wrote: > Which version of PG? Ahh, sorry, forgot that. The issue occurs in Debian (Etch) packaged version 7.4.16. I plan to switch soon to 8.1.8. > That's the reason why PG (check the newest releases, I seem to > remember that there has been some aggregate optimizations there), I'll verify this once I moved to the new version. > does > a SeqScan for select count(*) from table. btw, depending upon your > data, doing a select count(*) from table where user=X will use an > Index, but will still need to fetch the rows proper to validate them. I have an index on three (out of 7 columns) of this table. Is there any chance to optimize indexing regarding this. Well, to be honest I'm not really interested in the performance of count(*). I was just discussing general performance issues on the phone line and when my colleague asked me about the size of the database he just wonderd why this takes so long for a job his MS-SQL server is much faster. So in principle I was just asking a first question that is easy to ask. Perhaps I come up with more difficult optimisation questions. Kind regards Andreas. -- http://fam-tille.de ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote: >Craig A. James schrieb: >>You guys can correct me if I'm wrong, but the key feature that's missing >>from Postgres's flexible indexing is the ability to maintain state >>across queries. Something like this: >> >> select a, b, my_index_state() from foo where ... >> offset 100 limit 10 using my_index(prev_my_index_state); >> > >Yes, you are wrong :-) The technique is called "CURSOR" >if you maintain persistent connection per session >(e.g. stand allone application or clever pooling webapplication) Did you read the email before correcting it? From the part you trimmed out: >The problem is that relational databases were invented before the web >and its stateless applications. In the "good old days", you could >connect to a database and work for hours, and in that environment >cursors and such work well -- the RDBMS maintains the internal state of >the indexing system. But in a web environment, state information is >very difficult to maintain. There are all sorts of systems that try >(Enterprise Java Beans, for example), but they're very complex. It sounds like they wrote their own middleware to handle the problem, which is basically what you suggested (a "clever pooling web application") after saying "wrong". Mike Stone ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |