Unix Technical Forum

Inner join vs where-clause subquery

This is a discussion on Inner join vs where-clause subquery within the Pgsql Performance forums, part of the PostgreSQL category; --> I have the following query which performs extremely slow: select min(nlogid) as start_nlogid, max(nlogid) as end_nlogid, min(dtCreateDate) as start_transaction_timestamp, ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 09:58 AM
Jeremy Haile
 
Posts: n/a
Default Inner join vs where-clause subquery

I have the following query which performs extremely slow:
select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
where nlogid > ( select max(a.end_nlogid) from
activity_log_import_history a)
and dtCreateDate < '2006-12-18 9:10'


If I change the where clause to have the return value of the subquery it
runs very fast:
select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
where nlogid > 402123456
and dtCreateDate < '2006-12-18 9:10'


If I change the query to the following, it runs fast:
select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < ${IMPORT_TIMESTAMP}


I am running PG 8.2. Why is that this the case? Shouldn't the query
planner be smart enough to know that the first query is the same as the
second and third? The inner query does not refer to any columns outside
of itself. I personally find the first query easiest to read and wish
it performed well.

Jeremy Haile

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 09:58 AM
Richard Huxton
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

Jeremy Haile wrote:
> I have the following query which performs extremely slow:
> select min(nlogid) as start_nlogid,
> max(nlogid) as end_nlogid,
> min(dtCreateDate) as start_transaction_timestamp,
> max(dtCreateDate) as end_transaction_timestamp
> from activity_log_facts
> where nlogid > ( select max(a.end_nlogid) from
> activity_log_import_history a)
> and dtCreateDate < '2006-12-18 9:10'


Can you post the EXPLAIN ANALYSE for this one please? That'll show us
exactly what it's doing.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 09:58 AM
Jeremy Haile
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

Here is the explain analyze output:

Result (cost=9.45..9.46 rows=1 width=0) (actual
time=156589.390..156589.391 rows=1 loops=1)
InitPlan
-> Result (cost=0.04..0.05 rows=1 width=0) (actual
time=0.034..0.034 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=4) (actual
time=0.029..0.030 rows=1 loops=1)
-> Index Scan Backward using
activity_log_import_history_end_nlogid_idx on
activity_log_import_history a (cost=0.00..113.43
rows=2877 width=4) (actual time=0.027..0.027 rows=1
loops=1)
Filter: (end_nlogid IS NOT NULL)
-> Limit (cost=0.00..1.19 rows=1 width=12) (actual
time=0.052..0.052 rows=0 loops=1)
-> Index Scan using activity_log_facts_pkey on
activity_log_facts (cost=0.00..1831613.82 rows=1539298
width=12) (actual time=0.050..0.050 rows=0 loops=1)
Index Cond: (nlogid > $1)
Filter: ((nlogid IS NOT NULL) AND (dtcreatedate <
'2006-12-18 09:10:00'::timestamp without time zone))
-> Limit (cost=0.00..1.19 rows=1 width=12) (actual
time=0.006..0.006 rows=0 loops=1)
-> Index Scan Backward using activity_log_facts_pkey on
activity_log_facts (cost=0.00..1831613.82 rows=1539298
width=12) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (nlogid > $1)
Filter: ((nlogid IS NOT NULL) AND (dtcreatedate <
'2006-12-18 09:10:00'::timestamp without time zone))
-> Limit (cost=0.00..3.51 rows=1 width=12) (actual
time=100221.955..100221.955 rows=0 loops=1)
-> Index Scan using activity_log_facts_dtcreatedate_idx on
activity_log_facts (cost=0.00..5406927.50 rows=1539298
width=12) (actual time=100221.953..100221.953 rows=0 loops=1)
Index Cond: (dtcreatedate < '2006-12-18
09:10:00'::timestamp without time zone)
Filter: ((dtcreatedate IS NOT NULL) AND (nlogid > $1))
-> Limit (cost=0.00..3.51 rows=1 width=12) (actual
time=56367.367..56367.367 rows=0 loops=1)
-> Index Scan Backward using
activity_log_facts_dtcreatedate_idx on activity_log_facts
(cost=0.00..5406927.50 rows=1539298 width=12) (actual
time=56367.364..56367.364 rows=0 loops=1)
Index Cond: (dtcreatedate < '2006-12-18
09:10:00'::timestamp without time zone)
Filter: ((dtcreatedate IS NOT NULL) AND (nlogid > $1))
Total runtime: 156589.605 ms


On Tue, 19 Dec 2006 16:31:41 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > I have the following query which performs extremely slow:
> > select min(nlogid) as start_nlogid,
> > max(nlogid) as end_nlogid,
> > min(dtCreateDate) as start_transaction_timestamp,
> > max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > ( select max(a.end_nlogid) from
> > activity_log_import_history a)
> > and dtCreateDate < '2006-12-18 9:10'

>
> Can you post the EXPLAIN ANALYSE for this one please? That'll show us
> exactly what it's doing.
>
> --
> Richard Huxton
> Archonet Ltd


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 09:58 AM
Richard Huxton
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

Jeremy Haile wrote:
> Here is the explain analyze output:


Well, the row estimates are about as far out as you can get:

> -> Index Scan using activity_log_facts_pkey on
> activity_log_facts (cost=0.00..1831613.82 rows=1539298
> width=12) (actual time=0.050..0.050 rows=0 loops=1)


> -> Index Scan Backward using activity_log_facts_pkey on
> activity_log_facts (cost=0.00..1831613.82 rows=1539298
> width=12) (actual time=0.004..0.004 rows=0 loops=1)


> -> Index Scan using activity_log_facts_dtcreatedate_idx on
> activity_log_facts (cost=0.00..5406927.50 rows=1539298
> width=12) (actual time=100221.953..100221.953 rows=0 loops=1)


> -> Index Scan Backward using
> activity_log_facts_dtcreatedate_idx on activity_log_facts
> (cost=0.00..5406927.50 rows=1539298 width=12) (actual
> time=56367.364..56367.364 rows=0 loops=1)


Hmm - it's using the indexes on dtCreateDate and nlogid which seems
broadly sensible, and then plans to limit the results for min()/max().
However, it's clearly wrong about how many rows will satisfy
nlogid > (select max(a.end_nlogid) from activity_log_import_history a)

>>> select min(nlogid) as start_nlogid,
>>> max(nlogid) as end_nlogid,
>>> min(dtCreateDate) as start_transaction_timestamp,
>>> max(dtCreateDate) as end_transaction_timestamp
>>> from activity_log_facts
>>> where nlogid > ( select max(a.end_nlogid) from
>>> activity_log_import_history a)
>>> and dtCreateDate < '2006-12-18 9:10'


If you run explain on the other forms of your query, I'd guess it's much
more accurate. There's a simple way to see if that is the issue. Run the
sub-query and substitute the actual value returned into the query above.
Then, try the same but with a prepared query. If it's down to nlogid
estimates then the first should be fast and the second slow.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 09:58 AM
Jeremy Haile
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

Here's the query and explain analyze using the result of the sub-query
substituted:

QUERY
explain analyze select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
where nlogid > 478287801
and dtCreateDate < '2006-12-18 9:10'

EXPLAIN ANALYZE
Aggregate (cost=657.37..657.38 rows=1 width=12) (actual
time=0.018..0.019 rows=1 loops=1)
-> Index Scan using activity_log_facts_nlogid_idx on
activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual
time=0.014..0.014 rows=0 loops=1)
Index Cond: (nlogid > 478287801)
Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
time zone)
Total runtime: 0.076 ms


Sorry if the reason should be obvious, but I'm not the best at
interpreting the explains. Why is this explain so much simpler than the
other query plan (with the subquery)?



On Tue, 19 Dec 2006 18:23:06 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here is the explain analyze output:

>
> Well, the row estimates are about as far out as you can get:
>
> > -> Index Scan using activity_log_facts_pkey on
> > activity_log_facts (cost=0.00..1831613.82 rows=1539298
> > width=12) (actual time=0.050..0.050 rows=0 loops=1)

>
> > -> Index Scan Backward using activity_log_facts_pkey on
> > activity_log_facts (cost=0.00..1831613.82 rows=1539298
> > width=12) (actual time=0.004..0.004 rows=0 loops=1)

>
> > -> Index Scan using activity_log_facts_dtcreatedate_idx on
> > activity_log_facts (cost=0.00..5406927.50 rows=1539298
> > width=12) (actual time=100221.953..100221.953 rows=0 loops=1)

>
> > -> Index Scan Backward using
> > activity_log_facts_dtcreatedate_idx on activity_log_facts
> > (cost=0.00..5406927.50 rows=1539298 width=12) (actual
> > time=56367.364..56367.364 rows=0 loops=1)

>
> Hmm - it's using the indexes on dtCreateDate and nlogid which seems
> broadly sensible, and then plans to limit the results for min()/max().
> However, it's clearly wrong about how many rows will satisfy
> nlogid > (select max(a.end_nlogid) from activity_log_import_history a)
>
> >>> select min(nlogid) as start_nlogid,
> >>> max(nlogid) as end_nlogid,
> >>> min(dtCreateDate) as start_transaction_timestamp,
> >>> max(dtCreateDate) as end_transaction_timestamp
> >>> from activity_log_facts
> >>> where nlogid > ( select max(a.end_nlogid) from
> >>> activity_log_import_history a)
> >>> and dtCreateDate < '2006-12-18 9:10'

>
> If you run explain on the other forms of your query, I'd guess it's much
> more accurate. There's a simple way to see if that is the issue. Run the
> sub-query and substitute the actual value returned into the query above.
> Then, try the same but with a prepared query. If it's down to nlogid
> estimates then the first should be fast and the second slow.
>
> --
> Richard Huxton
> Archonet Ltd


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 09:58 AM
Richard Huxton
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

Jeremy Haile wrote:
> Here's the query and explain analyze using the result of the sub-query
> substituted:
>
> QUERY
> explain analyze select min(nlogid) as start_nlogid,
> max(nlogid) as end_nlogid,
> min(dtCreateDate) as start_transaction_timestamp,
> max(dtCreateDate) as end_transaction_timestamp
> from activity_log_facts
> where nlogid > 478287801
> and dtCreateDate < '2006-12-18 9:10'
>
> EXPLAIN ANALYZE
> Aggregate (cost=657.37..657.38 rows=1 width=12) (actual
> time=0.018..0.019 rows=1 loops=1)
> -> Index Scan using activity_log_facts_nlogid_idx on
> activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual
> time=0.014..0.014 rows=0 loops=1)
> Index Cond: (nlogid > 478287801)
> Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> time zone)
> Total runtime: 0.076 ms
>
>
> Sorry if the reason should be obvious, but I'm not the best at
> interpreting the explains. Why is this explain so much simpler than the
> other query plan (with the subquery)?


Because it's planning it with knowledge of what "nlogid"s it's filtering
by. It knows it isn't going to get many rows back with nlogid >
478287801. In your previous explain it thought a large number of rows
would match and was trying not to sequentially scan the
activity_log_facts table.

Ideally, the planner would evaluate the subquery in your original form
(it should know it's only getting one row back from max()). Then it
could plan the query as above. I'm not sure how tricky that is to do though.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 09:58 AM
Jeremy Haile
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

Makes sense. It is NOT executing the subquery more than once is it?

On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> > max(nlogid) as end_nlogid,
> > min(dtCreateDate) as start_transaction_timestamp,
> > max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> > -> Index Scan using activity_log_facts_nlogid_idx on
> > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual
> > time=0.014..0.014 rows=0 loops=1)
> > Index Cond: (nlogid > 478287801)
> > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> > time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains. Why is this explain so much simpler than the
> > other query plan (with the subquery)?

>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
> Richard Huxton
> Archonet Ltd


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 09:59 AM
Jeremy Haile
 
Posts: n/a
Default Re: Inner join vs where-clause subquery

I'm still confused as to why the inner join version ran so much faster
than the where-clause version.

Here's the inner join query and explain ouput:
select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < '2006-12-18 9:10'

Aggregate (cost=246226.95..246226.96 rows=1 width=12)
-> Nested Loop (cost=49233.27..231209.72 rows=1501722 width=12)
-> Result (cost=0.04..0.05 rows=1 width=0)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=4)
-> Index Scan Backward using
activity_log_import_history_end_nlogid_idx on
activity_log_import_history (cost=0.00..114.97
rows=2913 width=4)
Filter: (end_nlogid IS NOT NULL)
-> Bitmap Heap Scan on activity_log_facts
(cost=49233.23..210449.44 rows=1660817 width=12)
Recheck Cond: (activity_log_facts.nlogid >
a.previous_nlogid)
Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp
without time zone)
-> Bitmap Index Scan on activity_log_facts_nlogid_idx
(cost=0.00..49233.23 rows=1660817 width=0)
Index Cond: (activity_log_facts.nlogid >
a.previous_nlogid)


Since the inner join is basically the same thing as doing the
where-clause subquery, why does it generate a far different plan?



On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@archonet.com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> > max(nlogid) as end_nlogid,
> > min(dtCreateDate) as start_transaction_timestamp,
> > max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> > -> Index Scan using activity_log_facts_nlogid_idx on
> > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual
> > time=0.014..0.014 rows=0 loops=1)
> > Index Cond: (nlogid > 478287801)
> > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> > time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains. Why is this explain so much simpler than the
> > other query plan (with the subquery)?

>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
> Richard Huxton
> Archonet Ltd


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:56 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com