Unix Technical Forum

Re: BETWEEN optimizer problems with single-value range

This is a discussion on Re: BETWEEN optimizer problems with single-value range within the Pgsql Performance forums, part of the PostgreSQL category; --> Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb: > Attached is a simplified example of a performance problem we have seen, Odd. Can ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 08:17 AM
Andreas Kretschmer
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value range

Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb:

> Attached is a simplified example of a performance problem we have seen,


Odd. Can you tell us your PG-Version?


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknow)
Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 08:17 AM
Kevin Grittner
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

>>> On Wed, Mar 15, 2006 at 12:17 pm, in message
<20060315181735.GA22240@KanotixBox>, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb:
>
>> Attached is a simplified example of a performance problem we have

seen,
>
> Odd. Can you tell us your PG- Version?


I know we really should move to 8.1.3, but I haven't gotten to it yet.
We're on a build from the 8.1 stable branch as of February 10th, with a
patch to allow ANSI standard interpretation of string literals. (So
this is 8.1.2 with some 8.1.3 changes plus the string literal patch.)

If there are any changes in that time frame which might affect this
issue, I could deploy a standard release and make sure that I see the
same behavior. Let me know.

-Kevin



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 08:17 AM
Tom Lane
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Odd. Can you tell us your PG- Version?


> this is 8.1.2 with some 8.1.3 changes plus the string literal patch.)


8.1 is certainly capable of devising the plan you want, for example
in the regression database:

regression=# explain select * from tenk1 where thousand = 10 and tenthous between 42 and 144;
QUERY PLAN
------------------------------------------------------------------------------------
Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.00..6.01 rows=1 width=244)
Index Cond: ((thousand = 10) AND (tenthous >= 42) AND (tenthous <= 144))
(2 rows)

It looks to me like this is a matter of bad cost estimation, ie, it's
thinking the other index is cheaper to use. Why that is is not clear.
Can we see the pg_stats rows for ctofcNo and calDate?

Also, try to force it to generate the plan you want, so we can see what
it thinks the cost is for that. If you temporarily drop the wrong index
you should be able to get there:

begin;
drop index "Cal_CalDate";
explain analyze select ... ;
-- repeat as needed if it chooses some other wrong index
rollback;

I hope you have a play copy of the database to do this in ---
although it would be safe to do the above in a live DB, the DROP would
exclusive-lock the table until you finish the experiment and rollback,
which probably is not good for response time ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 08:17 AM
Kevin Grittner
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

>>> On Wed, Mar 15, 2006 at 1:17 pm, in message
<28798.1142450270@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> 8.1 is certainly capable of devising the plan you want, for example
> in the regression database:
>
> regression=# explain select * from tenk1 where thousand = 10 and

tenthous
> between 42 and 144;
> QUERY PLAN
>

------------------------------------------------------------------------------------
> Index Scan using tenk1_thous_tenthous on tenk1 (cost=0.00..6.01

rows=1
> width=244)
> Index Cond: ((thousand = 10) AND (tenthous >= 42) AND (tenthous <=

144))
> (2 rows)


That matches one of the examples where it optimized well. I only saw
the bad plan when low and high ends of the BETWEEN range were equal.

> It looks to me like this is a matter of bad cost estimation, ie,

it's
> thinking the other index is cheaper to use. Why that is is not

clear.
> Can we see the pg_stats rows for ctofcNo and calDate?


schemaname | tablename | attname | null_frac | avg_width | n_distinct
| most_common_vals
|
most_common_freqs |
histogram_bounds
| correlation
------------+-----------+---------+-----------+-----------+------------+-----------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+-------------
public | Cal | calDate | 0 | 4 | 2114
|
{2003-06-02,2000-06-20,2001-04-16,2003-06-17,2003-12-01,2004-10-12,2001-04-23,2001-10-15,2002-03-06,2002-05-03}
|
{0.00333333,0.00233333,0.00233333,0.00233333,0.002 33333,0.00233333,0.002,0.002,0.002,0.002}
|
{1986-03-14,1999-06-11,2000-07-14,2001-05-18,2002-03-21,2002-12-04,2003-08-12,2004-05-13,2005-02-01,2005-09-28,2080-12-31}
| 0.0545768
public | Cal | ctofcNo | 0 | 8 | 669
| {0793,1252,1571,0964,0894,1310,"DA ",0944,1668,0400}
|
{0.024,0.019,0.015,0.0123333,0.012,0.011,0.0106667 ,0.01,0.00966667,0.00866667}
| {0000,0507,0733,0878,1203,1336,14AG,1633,1971,3705 ,YVJO}
|
-0.0179665
(2 rows)


> Also, try to force it to generate the plan you want, so we can see

what
> it thinks the cost is for that. If you temporarily drop the wrong

index
> you should be able to get there:
>
> begin;
> drop index "Cal_CalDate";
> explain analyze select ... ;
> -- repeat as needed if it chooses some other wrong index
> rollback;


Sort (cost=4.03..4.03 rows=1 width=12) (actual time=48.484..48.486
rows=4 loops=1)
Sort Key: "calDate", "startTime"
-> Index Scan using "Cal_CtofcNo" on "Cal" "CA" (cost=0.00..4.02
rows=1 width=12) (actual time=36.750..48.228 rows=4 loops=1)
Index Cond: ((("ctofcNo")::bpchar = '2192'::bpchar) AND
(("calDate")::date >= '2006-03-15'::date) AND (("calDate")::date <=
'2006-03-15'::date))
Total runtime: 56.616 ms


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 08:17 AM
Simon Riggs
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

On Wed, 2006-03-15 at 14:17 -0500, Tom Lane wrote:

> It looks to me like this is a matter of bad cost estimation, ie, it's
> thinking the other index is cheaper to use. Why that is is not clear.
> Can we see the pg_stats rows for ctofcNo and calDate?


ISTM that when the BETWEEN constants match we end up in this part of
clauselist_selectivity()...



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 08:17 AM
Simon Riggs
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

On Thu, 2006-03-16 at 00:07 +0000, Simon Riggs wrote:
> On Wed, 2006-03-15 at 14:17 -0500, Tom Lane wrote:
>
> > It looks to me like this is a matter of bad cost estimation, ie, it's
> > thinking the other index is cheaper to use. Why that is is not clear.
> > Can we see the pg_stats rows for ctofcNo and calDate?

>
> ISTM that when the BETWEEN constants match we end up in this part of
> clauselist_selectivity()...


(and now for the whole email...)

/*
* It's just roundoff error; use a small positive
* value
*/
s2 = 1.0e-10;

so that the planner underestimates the cost of using "Cal_CalDate" so
that it ends up the same as "Cal_CtofcNo", and then we pick
"Cal_CalDate" because it was created first.

Using 1.0e-10 isn't very useful... the selectivity for a range should
never be less than the selectivity for an equality, so we should simply
put in a test against one of the pseudo constants and use that as the
minimal value. That should lead to raising the apparent cost of
Cal_CalDate so that Cal_CtofcNo can take precedence.

Best Regards, Simon Riggs






---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 08:17 AM
Tom Lane
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

Simon Riggs <simon@2ndquadrant.com> writes:
>> ISTM that when the BETWEEN constants match we end up in this part of
>> clauselist_selectivity()...


Yeah, I think you are right.

> so that the planner underestimates the cost of using "Cal_CalDate" so
> that it ends up the same as "Cal_CtofcNo", and then we pick
> "Cal_CalDate" because it was created first.


No, it doesn't end up the same --- but the difference is small enough to
be in the roundoff-error regime. The real issue here is that we're
effectively assuming that one row will be fetched from the index in both
cases, and this is clearly not the case for the Cal_CalDate index. So
we need a more accurate estimate for the boundary case.

> Using 1.0e-10 isn't very useful... the selectivity for a range should
> never be less than the selectivity for an equality, so we should simply
> put in a test against one of the pseudo constants and use that as the
> minimal value.


That's easier said than done, because you'd first have to find the
appropriate equality operator to use (ie, one having semantics that
agree with the inequality operators). Another point is that the above
statement is simply wrong, consider
calDate BETWEEN '2006-03-15' AND '2006-03-14'
for which an estimate of zero really is correct.

Possibly we could drop this code's reliance on seeing
SCALARLTSEL/SCALARGTSEL as the estimators, and instead try to locate a
common btree opclass for the operators --- which would then let us
identify the right equality operator to use, and also let us distinguish
> from >= etc. If we're trying to get the boundary cases right I

suspect we have to account for that. I could see such an approach being
tremendously slow though :-(, because we'd go looking for btree
opclasses even for operators that have nothing to do with < or >.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 08:18 AM
Simon Riggs
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

On Wed, 2006-03-15 at 21:05 -0500, Tom Lane wrote:
> So we need a more accurate estimate for the boundary case.


Agreed.

> > Using 1.0e-10 isn't very useful... the selectivity for a range should
> > never be less than the selectivity for an equality, so we should simply
> > put in a test against one of the pseudo constants and use that as the
> > minimal value.

>
> That's easier said than done, because you'd first have to find the
> appropriate equality operator to use (ie, one having semantics that
> agree with the inequality operators).

....

Kevin: this is also the reason we can't simply transform the WHERE
clause into a more appropriate form...

> Possibly we could drop this code's reliance on seeing
> SCALARLTSEL/SCALARGTSEL as the estimators, and instead try to locate a
> common btree opclass for the operators --- which would then let us
> identify the right equality operator to use, and also let us distinguish
> > from >= etc. If we're trying to get the boundary cases right I

> suspect we have to account for that. I could see such an approach being
> tremendously slow though :-(, because we'd go looking for btree
> opclasses even for operators that have nothing to do with < or >.


Trying to get the information in the wrong place would be very
expensive, I agree. But preparing that information when we have access
to it and passing it through the plan would be much cheaper. Relating
op->opclass will be very useful in other places in planning, even if any
one case seems not to justify the work to record it. (This case feels
like deja vu, all over again.)

The operator and the opclass are only connected via an index access
method, but for a particular index each column has only one opclass. So
the opclass will have a 1-1 correspondence with the operator for *that*
plan only, realising that other plans might have different
correspondences. find_usable_indexes() or thereabouts could annotate a
restriction OpExpr with the opclass it will use.

Once we have the link, clauselist_selectivity() can trivially compare
opclasses for both OpExprs, then retrieve other information for that
opclass for various purposes.

Seems lots of work for such a corner case, but would be worth it if this
solves other problems as well.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 08:18 AM
Tom Lane
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

Simon Riggs <simon@2ndquadrant.com> writes:
> Trying to get the information in the wrong place would be very
> expensive, I agree. But preparing that information when we have access
> to it and passing it through the plan would be much cheaper.


Where would that be?

> The operator and the opclass are only connected via an index access
> method, but for a particular index each column has only one opclass.


If you're proposing making clauselist_selectivity depend on what indexes
exist, I think that's very much the wrong approach. In the first place,
it still has to give usable answers for unindexed columns, and in the
second place there might be multiple indexes with different opclasses
for the same column, so the ambiguity problem still exists.

I have been wondering if we shouldn't add some more indexes on pg_amop
or something to make it easier to do this sort of lookup --- we
definitely seem to be finding multiple reasons to want to look up
which opclasses contain a given operator.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 08:18 AM
Simon Riggs
 
Posts: n/a
Default Re: BETWEEN optimizer problems with single-value

On Thu, 2006-03-16 at 10:57 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Trying to get the information in the wrong place would be very
> > expensive, I agree. But preparing that information when we have access
> > to it and passing it through the plan would be much cheaper.

>
> Where would that be?
>
> > The operator and the opclass are only connected via an index access
> > method, but for a particular index each column has only one opclass.

>
> If you're proposing making clauselist_selectivity depend on what indexes
> exist, I think that's very much the wrong approach.


Using available information sounds OK to me. Guess you're thinking of
the lack of plan invalidation?

> In the first place,
> it still has to give usable answers for unindexed columns, and in the
> second place there might be multiple indexes with different opclasses
> for the same column, so the ambiguity problem still exists.


I was thinking that we would fill out the OpExpr with different
opclasses for each plan, so each one sees a different story. (I was
thinking there was a clauselist for each plan; if not, there could be.)
So the multiple index problem shouldn't exist.

Non-indexed cases still cause the problem, true.

> I have been wondering if we shouldn't add some more indexes on pg_amop
> or something to make it easier to do this sort of lookup --- we
> definitely seem to be finding multiple reasons to want to look up
> which opclasses contain a given operator.


Agreed, but still looking for better way than that.

[BTW how do you add new indexes to system tables? I want to add one to
pg_inherits but not sure where to look.]

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:52 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com