This is a discussion on Speed difference between select ... union select ... and select frompartitioned_table within the Pgsql Performance forums, part of the PostgreSQL category; --> Hi List! I executed 2 equivalents queries. The first one uses a union structure. The second uses a partitioned ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi List! I executed 2 equivalents queries. The first one uses a union structure. The second uses a partitioned table. The tables are the same with 30 millions of rows each one and the returned rows are the same. But the union query perform faster than the partitioned query. My question is: why? [pabloa@igor testeo]$ cat query-union.sql select e, p, sum( c) as c from ( select e, p, count( *) as c from tt_00003 group by e, p union select e, p, count( *) as c from tt_00006 group by e, p union select e, p, count( *) as c from tt_00009 group by e, p union select e, p, count( *) as c from tt_00012 group by e, p union select e, p, count( *) as c from tt_00015 group by e, p ) as t group by e, p order by e, p desc; [pabloa@igor testeo]$ cat query-heritage.sql select e, p, count( *) as c from tt group by e, p order by e, p desc; The server is a Athlon 64x2 6000+ 2 Gb RAM PostreSQL 8.2.5 The structure tables are: CREATE TABLE tt_00003 ( -- Inherited: idtt bigint NOT NULL, -- Inherited: idttp bigint NOT NULL, -- Inherited: e integer NOT NULL, -- Inherited: dmodi timestamp without time zone NOT NULL DEFAULT now(), -- Inherited: p integer NOT NULL DEFAULT 0, -- Inherited: m text NOT NULL, CONSTRAINT tt_00003_pkey PRIMARY KEY (idtt), CONSTRAINT tt_00003_idtt_check CHECK (idtt >= 1::bigint AND idtt <= 30000000::bigint) ) INHERITS (tt) WITHOUT OIDS; ALTER TABLE tt_00003 ; CREATE INDEX tt_00003_e ON tt_00003 USING btree (e); ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Fri, 2007-10-26 at 16:37 -0400, Pablo Alcaraz wrote: > Hi List! > > I executed 2 equivalents queries. The first one uses a union structure. > The second uses a partitioned table. The tables are the same with 30 > millions of rows each one and the returned rows are the same. > > But the union query perform faster than the partitioned query. > I think you mean to use UNION ALL here. UNION forces a DISTINCT, which results in a sort operation. What surprises me is that the UNION is actually faster than the partitioning using inheritance. I suspect it has something to do with the GROUP BYs, but we won't know until you post EXPLAIN ANALYZE results. Regards, Jeff Davis ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| I forgot to post the times: query-union: 21:59 query-heritage: 1:31:24 Regards Pablo Pablo Alcaraz wrote: > Hi List! > > I executed 2 equivalents queries. The first one uses a union > structure. The second uses a partitioned table. The tables are the > same with 30 millions of rows each one and the returned rows are the > same. > > But the union query perform faster than the partitioned query. > > My question is: why? > > [pabloa@igor testeo]$ cat query-union.sql > select e, p, sum( c) as c > from ( > select e, p, count( *) as c > from tt_00003 > group by e, p > union > select e, p, count( *) as c > from tt_00006 > group by e, p > union > select e, p, count( *) as c > from tt_00009 > group by e, p > union > select e, p, count( *) as c > from tt_00012 > group by e, p > union > select e, p, count( *) as c > from tt_00015 > group by e, p > ) as t > group by e, p > order by e, p desc; > > > > [pabloa@igor testeo]$ cat query-heritage.sql > select e, p, count( *) as c > from tt > group by e, p > order by e, p desc; > > > The server is a Athlon 64x2 6000+ 2 Gb RAM PostreSQL 8.2.5 > > The structure tables are: > > CREATE TABLE tt_00003 > ( > -- Inherited: idtt bigint NOT NULL, > -- Inherited: idttp bigint NOT NULL, > -- Inherited: e integer NOT NULL, > -- Inherited: dmodi timestamp without time zone NOT NULL DEFAULT now(), > -- Inherited: p integer NOT NULL DEFAULT 0, > -- Inherited: m text NOT NULL, > CONSTRAINT tt_00003_pkey PRIMARY KEY (idtt), > CONSTRAINT tt_00003_idtt_check CHECK (idtt >= 1::bigint AND idtt <= > 30000000::bigint) > ) INHERITS (tt) > WITHOUT OIDS; > ALTER TABLE tt_00003 ; > > CREATE INDEX tt_00003_e > ON tt_00003 > USING btree > (e); > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| "Pablo Alcaraz" <pabloa@laotraesquina.com.ar> writes: > Hi List! > > I executed 2 equivalents queries. The first one uses a union structure. The > second uses a partitioned table. The tables are the same with 30 millions of > rows each one and the returned rows are the same. > > But the union query perform faster than the partitioned query. > > My question is: why? > > [pabloa@igor testeo]$ cat query-union.sql > select e, p, sum( c) as c > from ( > select e, p, count( *) as c > from tt_00003 > group by e, p > union > select e, p, count( *) as c > from tt_00006 > group by e, p > union .... You should send along the "explain analyze" results for both queries, otherwise we're just guessing. Also, you should consider using UNION ALL instead of plain UNION. Finally you should consider removing all the intermediate GROUP BYs and just group the entire result. In theory it should be faster but in practice I'm not sure it works out that way. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| These are the EXPLAIN ANALIZE: I ran both queries on a CLUSTER and ANALYZEd tables: UNION QUERY explain analyze select e, p, sum( c) as c from ( select e, p, count( *) as c from tt_00003 group by e, p union select e, p, count( *) as c from tt_00006 group by e, p union select e, p, count( *) as c from tt_00009 group by e, p union select e, p, count( *) as c from tt_00012 group by e, p union select e, p, count( *) as c from tt_00015 group by e, p ) as t group by e, p order by e, p desc; "Sort (cost=2549202.87..2549203.37 rows=200 width=16) (actual time=263593.182..263593.429 rows=207 loops=1)" " Sort Key: t.e, t.p" " -> HashAggregate (cost=2549192.73..2549195.23 rows=200 width=16) (actual time=263592.469..263592.763 rows=207 loops=1)" " -> Unique (cost=2549172.54..2549179.88 rows=734 width=8) (actual time=263590.481..263591.764 rows=356 loops=1)" " -> Sort (cost=2549172.54..2549174.38 rows=734 width=8) (actual time=263590.479..263590.891 rows=356 loops=1)" " Sort Key: e, p, c" " -> Append (cost=1307131.88..2549137.60 rows=734 width=8) (actual time=132862.176..263589.774 rows=356 loops=1)" " -> HashAggregate (cost=1307131.88..1307133.03 rows=92 width=8) (actual time=132862.173..132862.483 rows=200 loops=1)" " -> Seq Scan on tt_00003 (cost=0.00..1081550.36 rows=30077536 width=8) (actual time=10.135..83957.424 rows=30000000 loops=1)" " -> HashAggregate (cost=1241915.64..1241916.16 rows=42 width=8) (actual time=130726.219..130726.457 rows=156 loops=1)" " -> Seq Scan on tt_00006 (cost=0.00..1028793.22 rows=28416322 width=8) (actual time=11.389..87338.730 rows=28351293 loops=1)" " -> HashAggregate (cost=24.53..27.03 rows=200 width=8) (actual time=0.005..0.005 rows=0 loops=1)" " -> Seq Scan on tt_00009 (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " -> HashAggregate (cost=24.53..27.03 rows=200 width=8) (actual time=0.004..0.004 rows=0 loops=1)" " -> Seq Scan on tt_00012 (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " -> HashAggregate (cost=24.53..27.03 rows=200 width=8) (actual time=0.005..0.005 rows=0 loops=1)" " -> Seq Scan on tt_00015 (cost=0.00..18.30 rows=830 width=8) (actual time=0.001..0.001 rows=0 loops=1)" "Total runtime: 263594.381 ms" PARTITIONED QUERY explain analyze select e, p, count( *) as c from tt group by e, p order by e, p desc; "GroupAggregate (cost=13256958.67..13842471.95 rows=40000 width=8) (actual time=899391.384..1065585.531 rows=207 loops=1)" " -> Sort (cost=13256958.67..13403211.99 rows=58501328 width=8) (actual time=899391.364..989749.914 rows=58351293 loops=1)" " Sort Key: public.tt.e, public.tt.p" " -> Append (cost=0.00..2110508.28 rows=58501328 width=8) (actual time=14.031..485211.466 rows=58351293 loops=1)" " -> Seq Scan on tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " -> Seq Scan on tt_00003 tt (cost=0.00..1081550.36 rows=30077536 width=8) (actual time=14.024..178657.738 rows=30000000 loops=1)" " -> Seq Scan on tt_00006 tt (cost=0.00..1028793.22 rows=28416322 width=8) (actual time=39.852..168307.030 rows=28351293 loops=1)" " -> Seq Scan on tt_00009 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.001..0.001 rows=0 loops=1)" " -> Seq Scan on tt_00012 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.001..0.001 rows=0 loops=1)" " -> Seq Scan on tt_00015 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.001..0.001 rows=0 loops=1)" " -> Seq Scan on tt_00018 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.001..0.001 rows=0 loops=1)" " -> Seq Scan on tt_00021 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " -> Seq Scan on tt_00024 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " -> Seq Scan on tt_00027 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " -> Seq Scan on tt_00030 tt (cost=0.00..18.30 rows=830 width=8) (actual time=0.002..0.002 rows=0 loops=1)" "Total runtime: 1066301.084 ms" Any idea? Regards Pablo Jeff Davis wrote: > On Fri, 2007-10-26 at 16:37 -0400, Pablo Alcaraz wrote: > >> Hi List! >> >> I executed 2 equivalents queries. The first one uses a union structure. >> The second uses a partitioned table. The tables are the same with 30 >> millions of rows each one and the returned rows are the same. >> >> But the union query perform faster than the partitioned query. >> >> > > I think you mean to use UNION ALL here. UNION forces a DISTINCT, which > results in a sort operation. What surprises me is that the UNION is > actually faster than the partitioning using inheritance. > > I suspect it has something to do with the GROUP BYs, but we won't know > until you post EXPLAIN ANALYZE results. > > Regards, > Jeff Davis > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > > |
| |||
| Pablo Alcaraz <pabloa@laotraesquina.com.ar> writes: > These are the EXPLAIN ANALIZE: If you raise work_mem enough to let the second query use a hash aggregate (probably a few MB would do it), I think it'll be about the same speed as the first one. The reason it's not picking that on its own is the overestimate of the number of resulting groups. This is because get_variable_numdistinct is not smart about append relations. We should try to fix that sometime... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Fri, 2007-10-26 at 16:37 -0400, Pablo Alcaraz wrote: > I executed 2 equivalents queries. The first one uses a union structure. > The second uses a partitioned table. The tables are the same with 30 > millions of rows each one and the returned rows are the same. > > But the union query perform faster than the partitioned query. > > My question is: why? The two queries are equivalent but they have different execution plans. The UNION query has explicit GROUP BY operations within it. We do not currently perform a push-down operation onto the individual partitions. This results in more data copying as well as requiring a single very large sort, rather than lots of small ones. That is probably enough to allow it to perform the sort in memory rather than on-disk, thus allowing a considerable speed-up. This is on my list of requirements for further partitioning improvements in 8.4 or beyond. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| Pablo Alcaraz wrote: > These are the EXPLAIN ANALIZE: > If you raise work_mem enough to let the second query use a hash aggregate (probably a few MB would do it), I think it'll be about the same speed as the first one. The reason it's not picking that on its own is the overestimate of the number of resulting groups. This is because get_variable_numdistinct is not smart about append relations. We should try to fix that sometime... I re run the partitioned-query. it completed in 15996 seconds. It builded a BIG temp file: [root@igor xxx]# ls -lh pgsql-data/data/16386/pgsql_tmp/ total 2.2G -rw------- 1 postgres postgres 1.0G Oct 27 15:35 pgsql_tmp7004.0 -rw------- 1 postgres postgres 1.0G Oct 27 15:35 pgsql_tmp7004.1 -rw------- 1 postgres postgres 175M Oct 27 15:35 pgsql_tmp7004.2 work_mem=1Mb. How much do I need to raise work_mem variable? 2.2G? Regards Pablo |