Unix Technical Forum

Re:

This is a discussion on Re: within the Pgsql Performance forums, part of the PostgreSQL category; --> On Mar 5, 2007, at 8:54 PM, Tom Lane wrote: > Hm, the cost for the upper nestloop is ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 10:21 AM
Jeff Cole
 
Posts: n/a
Default Re:


On Mar 5, 2007, at 8:54 PM, Tom Lane wrote:

> Hm, the cost for the upper nestloop is way less than you would expect
> given that the HASH IN join is going to have to be repeated 100+
> times.
> I think this must be due to a very low "join_in_selectivity" estimate
> but I'm not sure why you are getting that, especially seeing that the
> rowcount estimates aren't far off. Can you show us the pg_stats
> rows for symptoms.id and symptom_reports.symptom_id?
>


Hi Tom, thanks for the response. Here are the pg_stats. I think I
understand what the stats say, but I don't know what to conclude from
them.


plm_stage=# select * from pg_stats where tablename = 'symptoms' and
attname = 'id';
schemaname | tablename | attname | null_frac | avg_width | n_distinct
| most_common_vals | most_common_freqs |
histogram_bounds | correlation
------------+-----------+---------+-----------+-----------
+------------+------------------+-------------------
+-------------------------------------+-------------
public | symptoms | id | 0 | 4 | -1
| | |
{1,11,24,34,46,57,71,85,95,106,117} | 0.451606


plm_stage=# select * from pg_stats where tablename =
'symptom_reports' and attname = 'symptom_id';
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals
|
most_common_freqs |
histogram_bounds | correlation
------------+-----------------+------------+-----------+-----------
+------------+------------------------
+-----------------------------------------------------------------------
---------------+-------------------------------------+-------------
public | symptom_reports | symptom_id | 0 | 4
| 80 | {3,2,4,1,5,8,9,7,10,6} |
{0.094,0.0933333,0.0933333,0.092,0.0913333,0.09033 33,0.0866667,0.0843333
,0.084,0.08} | {12,18,24,30,38,44,51,57,91,91,114} | 0.0955925



And Ismo, I followed your suggestion to re-write the SQL more
cleanly, and you are right it was faster, so that is certainly a
solution. Although I am still curious why my original query slowed
down after the vacuum analyze. In any case, here is the explain
analyze from the new query. Compare that to the 3441.452 ms of the
old query after the analyze (and 134.045 ms before the analyze):

plm_stage=# explain analyze SELECT count(distinct s.id) AS count_all
FROM symptoms s ,symptom_reports sr,users u WHERE s.id=sr.symptom_id
and sr.user_id=u.id and u.disease_id in (1);
QUERY
PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------
Aggregate (cost=770.05..770.06 rows=1 width=4) (actual
time=176.749..176.751 rows=1 loops=1)
-> Hash Join (cost=89.43..737.50 rows=13020 width=4) (actual
time=7.762..142.063 rows=13038 loops=1)
Hash Cond: ("outer".symptom_id = "inner".id)
-> Hash Join (cost=86.09..538.86 rows=13020 width=4)
(actual time=7.277..89.293 rows=13038 loops=1)
Hash Cond: ("outer".user_id = "inner".id)
-> Seq Scan on symptom_reports sr
(cost=0.00..257.38 rows=13038 width=8) (actual time=0.003..30.499
rows=13038 loops=1)
-> Hash (cost=82.41..82.41 rows=1471 width=4)
(actual time=7.261..7.261 rows=1471 loops=1)
-> Seq Scan on users u (cost=0.00..82.41
rows=1471 width=4) (actual time=0.006..4.133 rows=1471 loops=1)
Filter: (disease_id = 1)
-> Hash (cost=3.07..3.07 rows=107 width=4) (actual
time=0.469..0.469 rows=107 loops=1)
-> Seq Scan on symptoms s (cost=0.00..3.07 rows=107
width=4) (actual time=0.007..0.247 rows=107 loops=1)
Total runtime: 176.842 ms
(12 rows)



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 10:21 AM
Tom Lane
 
Posts: n/a
Default Re:

Jeff Cole <cole.jeff@gmail.com> writes:
> Hi Tom, thanks for the response. Here are the pg_stats. I think I
> understand what the stats say, but I don't know what to conclude from
> them.


OK, the symptom_id row claims there are only 80 distinct values of
symptom_id in symptom_reports. This is a bit low (looks like the true
state of affairs is that all but 2 of the 108 entries of symptoms are
represented in symptom_reports), but it's not horridly off considering
that you're using the rather low default statistics_target. What
happens is that the planner expects that on average only 80 rows of the
inner join will need to be scanned to find a match for a given symptoms.id,
and this makes the nestloop look cheap. However, per your previous
EXPLAIN ANALYZE:

> -> Nested Loop IN Join (cost=149.05..586.26 rows=85 width=0) (actual time=54.517..3441.115 rows=106 loops=1)
> Join Filter: ("outer".id = "inner".symptom_id)
> -> Seq Scan on symptoms (cost=0.00..3.08 rows=108 width=4) (actual time=0.007..0.273 rows=108 loops=1)
> -> Hash IN Join (cost=149.05..603.90 rows=13074 width=4) (actual time=0.078..24.503 rows=3773 loops=108)



the *actual* average number of rows scanned is 3773. I'm not sure why
this should be --- is it possible that the distribution of keys in
symptom_reports is wildly uneven? This could happen if all of the
physically earlier rows in symptom_reports contain the same small set
of symptom_ids, but the stats don't seem to indicate such a skew.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 10:21 AM
Jeff Cole
 
Posts: n/a
Default Re:


On Mar 6, 2007, at 11:40 AM, Tom Lane wrote:

> the *actual* average number of rows scanned is 3773. I'm not sure why
> this should be --- is it possible that the distribution of keys in
> symptom_reports is wildly uneven? This could happen if all of the
> physically earlier rows in symptom_reports contain the same small set
> of symptom_ids, but the stats don't seem to indicate such a skew.


Hi Tom, you are correct, the distribution is uneven... In the 13k
symptom_reports rows, there are 105 distinct symptom_ids. But the
first 8k symptom_reports rows only have 10 distinct symptom_ids.
Could this cause the problem and would there be anything I could do
to address it?

Thanks for all your help, I appreciate it.

-Jeff

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 10:21 AM
Tom Lane
 
Posts: n/a
Default Re:

Jeff Cole <cole.jeff@gmail.com> writes:
> Hi Tom, you are correct, the distribution is uneven... In the 13k
> symptom_reports rows, there are 105 distinct symptom_ids. But the
> first 8k symptom_reports rows only have 10 distinct symptom_ids.
> Could this cause the problem and would there be anything I could do
> to address it?


Ah-hah, yeah, that explains it. Is it worth your time to deliberately
randomize the order of the rows in symptom_reports? It wasn't clear
whether this query is actually something you need to optimize. You
might have other queries that benefit from the rows being in nonrandom
order, so I'm not entirely sure that this is a good thing to do ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 10:21 AM
Jeff Cole
 
Posts: n/a
Default Re:

Hi Tom, thanks for the great job getting to the core of this
problem... I would say I'm not sure I want randomize the rows (not
really even sure how to do it without truncating the table and re-
adding the records in a random order). I think for the moment I
will either a) re-write the query per Ismo's suggestion, or b) wait
until more data comes into that table, potentially kicking the query
planner into not using the Nested Loop anymore.

Anyway, thanks again, I appreciate it...

-Jeff


On Mar 7, 2007, at 11:37 AM, Tom Lane wrote:

> Jeff Cole <cole.jeff@gmail.com> writes:
>> Hi Tom, you are correct, the distribution is uneven... In the 13k
>> symptom_reports rows, there are 105 distinct symptom_ids. But the
>> first 8k symptom_reports rows only have 10 distinct symptom_ids.
>> Could this cause the problem and would there be anything I could do
>> to address it?

>
> Ah-hah, yeah, that explains it. Is it worth your time to deliberately
> randomize the order of the rows in symptom_reports? It wasn't clear
> whether this query is actually something you need to optimize. You
> might have other queries that benefit from the rows being in nonrandom
> order, so I'm not entirely sure that this is a good thing to do ...
>
> regards, tom lane



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:13 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com