This is a discussion on Re: Group by more efficient than distinct? within the Pgsql Performance forums, part of the PostgreSQL category; --> PFC writes: >- If you process up to some percentage of your RAM worth of data, hashing > is ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| PFC writes: >- If you process up to some percentage of your RAM worth of data, hashing > is going to be a lot faster Thanks for the excellent breakdown and explanation. I will try and get sizes of the tables in question and how much memory the machines have. > - If you need DISTINCT ON, well, you're stuck with the Sort > - So, for the time being, you can replace DISTINCT with GROUP BY... Have seen a few of those already on some code (new job..) so for those it is a matter of having a good disk subsystem? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On Sun, 20 Apr 2008 17:15:36 +0200, Francisco Reyes <lists@stringsutils.com> wrote: > PFC writes: > >> - If you process up to some percentage of your RAM worth of data, >> hashing is going to be a lot faster > > Thanks for the excellent breakdown and explanation. I will try and get > sizes of the tables in question and how much memory the machines have. Actually, the memory used by the hash depends on the number of distinct values, not the number of rows which are processed... Consider : SELECT a GROUP BY a SELECT a,count(*) GROUP BY a In both cases the hash only holds discinct values. So if you have 1 million rows to process but only 10 distinct values of "a", the hash will only contain those 10 values (and the counts), so it will be very small and fast, it will absorb a huge seq scan without problem. If however, you have (say) 100 million distinct values for a, using a hash would be a bad idea. As usual, divide the size of your RAM by the number of concurrent connections or something. Note that "a" could be a column, several columns, anything, the size of the hash will be proportional to the number of distinct values, ie. the number of rows returned by the query, not the number of rows processed (read) by the query. Same with hash joins etc, that's why when you join a very small table to a large one Postgres likes to use seq scan + hash join on the small table. >> - If you need DISTINCT ON, well, you're stuck with the Sort >> - So, for the time being, you can replace DISTINCT with GROUP BY... > > Have seen a few of those already on some code (new job..) so for those > it is a matter of having a good disk subsystem? Depends on your RAM, sorting in RAM is always faster than sorting on disk of course, unless you eat all the RAM and trash the other processes. Tradeoffs... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| PFC wrote: > Actually, the memory used by the hash depends on the number of > distinct values, not the number of rows which are processed... > Consider : > > SELECT a GROUP BY a > SELECT a,count(*) GROUP BY a > > In both cases the hash only holds discinct values. So if you have > 1 million rows to process but only 10 distinct values of "a", the hash > will only contain those 10 values (and the counts), so it will be very > small and fast, it will absorb a huge seq scan without problem. If > however, you have (say) 100 million distinct values for a, using a > hash would be a bad idea. As usual, divide the size of your RAM by the > number of concurrent connections or something. > Note that "a" could be a column, several columns, anything, the > size of the hash will be proportional to the number of distinct > values, ie. the number of rows returned by the query, not the number > of rows processed (read) by the query. Same with hash joins etc, > that's why when you join a very small table to a large one Postgres > likes to use seq scan + hash join on the small table. This surprises me - hash values are lossy, so it must still need to confirm against the real list of values, which at a minimum should require references to the rows to check against? Is PostgreSQL doing something beyond my imagination? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc> -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| Mark Mielke wrote: > PFC wrote: >> Actually, the memory used by the hash depends on the number of >> distinct values, not the number of rows which are processed... >> Consider : >> >> SELECT a GROUP BY a >> SELECT a,count(*) GROUP BY a >> >> In both cases the hash only holds discinct values. So if you have >> 1 million rows to process but only 10 distinct values of "a", the >> hash will only contain those 10 values (and the counts), so it will >> be very small and fast, it will absorb a huge seq scan without >> problem. If however, you have (say) 100 million distinct values for >> a, using a hash would be a bad idea. As usual, divide the size of >> your RAM by the number of concurrent connections or something. >> Note that "a" could be a column, several columns, anything, the >> size of the hash will be proportional to the number of distinct >> values, ie. the number of rows returned by the query, not the number >> of rows processed (read) by the query. Same with hash joins etc, >> that's why when you join a very small table to a large one Postgres >> likes to use seq scan + hash join on the small table. > > This surprises me - hash values are lossy, so it must still need to > confirm against the real list of values, which at a minimum should > require references to the rows to check against? > > Is PostgreSQL doing something beyond my imagination? :-) Hmmm... You did say distinct values, so I can see how that would work for distinct. What about seq scan + hash join, though? To complete the join, wouldn't it need to have a reference to each of the rows to join against? If there is 20 distinct values and 200 rows in the small table - wouldn't it need 200 references to be stored? Cheers, mark -- Mark Mielke <mark@mielke.cc> -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On Mon, 21 Apr 2008, Mark Mielke wrote: > This surprises me - hash values are lossy, so it must still need to confirm > against the real list of values, which at a minimum should require references > to the rows to check against? > > Is PostgreSQL doing something beyond my imagination? :-) Not too far beyond your imagination, I hope. It's simply your assumption that the hash table is lossy. Sure, hash values are lossy, but a hash table isn't. Postgres stores in memory not only the hash values, but the rows they refer to as well, having checked them all on disc beforehand. That way, it doesn't need to look up anything on disc for that branch of the join again, and it has a rapid in-memory lookup for each row. Matthew -- X's book explains this very well, but, poor bloke, he did the Cambridge Maths Tripos... -- Computer Science Lecturer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| Matthew Wakeling wrote: > On Mon, 21 Apr 2008, Mark Mielke wrote: >> This surprises me - hash values are lossy, so it must still need to >> confirm against the real list of values, which at a minimum should >> require references to the rows to check against? >> >> Is PostgreSQL doing something beyond my imagination? :-) > > Not too far beyond your imagination, I hope. > > It's simply your assumption that the hash table is lossy. Sure, hash > values are lossy, but a hash table isn't. Postgres stores in memory > not only the hash values, but the rows they refer to as well, having > checked them all on disc beforehand. That way, it doesn't need to look > up anything on disc for that branch of the join again, and it has a > rapid in-memory lookup for each row. I said hash *values* are lossy. I did not say hash table is lossy. The poster I responded to said that the memory required for a hash join was relative to the number of distinct values, not the number of rows. They gave an example of millions of rows, but only a few distinct values. Above, you agree with me that it it would include the rows (or at least references to the rows) as well. If it stores rows, or references to rows, then memory *is* relative to the number of rows, and millions of records would require millions of rows (or row references). Cheers, mark -- Mark Mielke <mark@mielke.cc> -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On Tue, 22 Apr 2008, Mark Mielke wrote: > The poster I responded to said that the memory required for a hash join was > relative to the number of distinct values, not the number of rows. They gave > an example of millions of rows, but only a few distinct values. Above, you > agree with me that it it would include the rows (or at least references to > the rows) as well. If it stores rows, or references to rows, then memory *is* > relative to the number of rows, and millions of records would require > millions of rows (or row references). Yeah, I think we're talking at cross-purposes, due to hash tables being used in two completely different places in Postgres. Firstly, you have hash joins, where Postgres loads the references to the actual rows, and puts those in the hash table. For that situation, you want a small number of rows. Secondly, you have hash aggregates, where Postgres stores an entry for each "group" in the hash table, and does not store the actual rows. For that situation, you can have a bazillion individual rows, but only a small number of distinct groups. Matthew -- First law of computing: Anything can go wro sig: Segmentation fault. core dumped. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| ||||
| Matthew Wakeling wrote: > On Tue, 22 Apr 2008, Mark Mielke wrote: >> The poster I responded to said that the memory required for a hash >> join was relative to the number of distinct values, not the number of >> rows. They gave an example of millions of rows, but only a few >> distinct values. Above, you agree with me that it it would include >> the rows (or at least references to the rows) as well. If it stores >> rows, or references to rows, then memory *is* relative to the number >> of rows, and millions of records would require millions of rows (or >> row references). > > Yeah, I think we're talking at cross-purposes, due to hash tables > being used in two completely different places in Postgres. Firstly, > you have hash joins, where Postgres loads the references to the actual > rows, and puts those in the hash table. For that situation, you want a > small number of rows. Secondly, you have hash aggregates, where > Postgres stores an entry for each "group" in the hash table, and does > not store the actual rows. For that situation, you can have a > bazillion individual rows, but only a small number of distinct groups. That makes sense with my reality. :-) Thanks, mark -- Mark Mielke <mark@mielke.cc> -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |