Unix Technical Forum

LIKE search and performance

This is a discussion on LIKE search and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> PFC wrote: > >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-19-2008, 10:51 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

PFC wrote:
>
>> OK - any application that allows user-built queries: <choose column:
>> foo> <choose filter: contains> <choose target: "bar">
>>
>> Want another? Any application that has a "search by name" box - users
>> can (and do) put one letter in and hit enter.
>>
>> Unfortunately you don't always have control over the selectivity of
>> queries issued.

>
> -*- HOW TO MAKE A SEARCH FORM -*-
>
> Imagine you have to code the search on IMDB.
>
> This is what a smart developer would do


All good domain-specific tips to provide users with a satisfying
search-experience.

None of which address the question of what plan PG should produce for:
SELECT * FROM bigtable WHERE foo LIKE 's%'

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-19-2008, 10:51 AM
PFC
 
Posts: n/a
Default Re: LIKE search and performance

> None of which address the question of what plan PG should produce for:
> SELECT * FROM bigtable WHERE foo LIKE 's%'


Ah, this one already uses the btree since the '%' is at the end.
My point is that a search like this will yield too many results to be
useful to the user anyway, so optimizing its performance is a kind of red
herring.


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-19-2008, 10:51 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

mark@mark.mielke.cc wrote:
> On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote:
>>> I notice you did not provide a real life example as requested. :-)

>> OK - any application that allows user-built queries: <choose column:
>> foo> <choose filter: contains> <choose target: "bar">
>> Want another? Any application that has a "search by name" box - users
>> can (and do) put one letter in and hit enter.
>> Unfortunately you don't always have control over the selectivity of
>> queries issued.

>
> The database has 10 million records. The user enters "bar" and it
> translates to "%bar%". You are suggesting that we expect bar to match
> 1 million+ records? :-)


I was saying that you don't know. At least, I don't know of any cheap
way of gathering full substring stats or doing a full substring
indexing. Even tsearch2 can't do that.

> I hope not. I would define this as bad process. I would also use "LIMIT"
> to something like "100".


Yes, but that's not the query we're talking about is it? If possible you
don't do '%bar%' searches at all. If you do, you try to restrict it
further or LIMIT the results. There's nothing to discuss in these cases.

>>> This seems like an ivory tower restriction. Not allowing best performance
>>> in a common situation vs not allowing worst performance in a not-so-common
>>> situation.

>> What best performance plan are you thinking of? I'm assuming we're
>> talking about trailing-wildcard matches here, rather than "contains"
>> style matches.

>
> "Trailing-wildcard" already uses B-Tree index, does it not?


Yes, it searches the btree and then checks the data for visibility. I
thought that was what you felt could be worked around. It appears I was
wrong.

> I am speaking of contains, as contains is the one that was said to
> require a seqscan. I am questioning why it requires a seqscan.


Well, you seemed to be suggesting you had something better in mind. At
least, that was my reading of your original post.

> The
> claim was made that with MVCC, the index is insufficient to check
> for visibility


True, for PG's implementation of MVCC. You *could* have visibility in
each index, but that obviously would take more space. For a table with
many indexes, that could be a *lot* more space. You also have to update
all that visibilty information too.

> and that the table would need to be accessed anyways,
> therefore a seqscan is required. I question whether a like '%bar%'
> should be considered a high selectivity query in the general case.
> I question whether a worst case should be assumed.


Well, the general rule-of-thumb is only about 10% for the changeover
between index & seq-scan. That is, once you are reading 10% of the rows
on disk (to check visibility) you might as well read them all (since
you'll be reading most of the blocks anyway if the rows are randomly
distributed). If you are doing SELECT * from that table then you'll want
all that data you read. If you are doing SELECT count(*) then you only
wanted the visibility :-(

Now you and I can look at a substring and probably make a good guess how
common it is (assuming we know the targets are British surnames or
Japanese towns). PG needs one number - or rather, it picks one number
for each length of search-string (afaik).

> Perhaps I question too much? :-)


Not sure it's possible to question too much :-)
However, you need to provide answers occasionally too - what numbers
would you pick? :-)

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-19-2008, 10:51 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

PFC wrote:
>> None of which address the question of what plan PG should produce for:
>> SELECT * FROM bigtable WHERE foo LIKE 's%'

>
> Ah, this one already uses the btree since the '%' is at the end.
> My point is that a search like this will yield too many results to
> be useful to the user anyway, so optimizing its performance is a kind of
> red herring.


At the *application level* yes.
At the *query planner* level no.

At the query planner level I just want it to come up with the best plan
it can. The original argument was that PG's estimate of the number of
matching rows was too optimistic (or pessimistic) in the case where we
are doing a contains substring-search.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-19-2008, 10:51 AM
Gregory Stark
 
Posts: n/a
Default Re: LIKE search and performance

"Richard Huxton" <dev@archonet.com> writes:

> Now you and I can look at a substring and probably make a good guess how common
> it is (assuming we know the targets are British surnames or Japanese towns). PG
> needs one number - or rather, it picks one number for each length of
> search-string (afaik).


I don't think that's true. Postgres calculates the lower and upper bound
implied by the search pattern and then uses the histogram to estimate how
selective that range is. It's sometimes surprisingly good but obviously it's
not perfect.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-19-2008, 10:52 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

Gregory Stark wrote:
> "Richard Huxton" <dev@archonet.com> writes:
>
>> Now you and I can look at a substring and probably make a good guess how common
>> it is (assuming we know the targets are British surnames or Japanese towns). PG
>> needs one number - or rather, it picks one number for each length of
>> search-string (afaik).

>
> I don't think that's true. Postgres calculates the lower and upper bound
> implied by the search pattern and then uses the histogram to estimate how
> selective that range is. It's sometimes surprisingly good but obviously it's
> not perfect.


Sorry - I'm obviously picking my words badly today.

I meant for the "contains" substring match. It gives different (goes
away and checks...yes) predictions based on string length. So it guesses
that LIKE '%aaa%' will match more than LIKE '%aaaa%'. Of course, if we
were matching surnames you and I could say that this is very unlikely,
but without some big statistics table I guess there's not much more PG
can do.

For a trailing wildcard LIKE 'aaa%' it can and does as you say convert
this into something along the lines of (>= 'aaa' AND < 'aab'). Although
IIRC that depends if your locale allows such (not sure, I don't really
use non-C/non-English locales enough).

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-19-2008, 10:57 AM
James Mansion
 
Posts: n/a
Default Re: LIKE search and performance

mark@mark.mielke.cc wrote:
> What is a real life example where an intelligent and researched
> database application would issue a like or ilike query as their
> primary condition in a situation where they expected very high
> selectivity?
>

In my case the canonical example is to search against textual keys where
the search is
performed automatically if the user hs typed enough data and paused. In
almost all
cases the '%' trails, and I'm looking for 'starts with' in effect.
usually the search will have
a specified upper number of returned rows, if that's an available
facility. I realise in this
case that matching against the index does not allow the match count
unless we check
MVCC as we go, but I don't see why another thread can't be doing that.

James


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-19-2008, 10:57 AM
mark@mark.mielke.cc
 
Posts: n/a
Default Re: LIKE search and performance

On Wed, Jun 06, 2007 at 11:23:13PM +0100, James Mansion wrote:
> mark@mark.mielke.cc wrote:
> >What is a real life example where an intelligent and researched
> >database application would issue a like or ilike query as their
> >primary condition in a situation where they expected very high
> >selectivity?

> In my case the canonical example is to search against textual keys
> where the search is performed automatically if the user hs typed
> enough data and paused. In almost all cases the '%' trails, and I'm
> looking for 'starts with' in effect. usually the search will have a
> specified upper number of returned rows, if that's an available
> facility. I realise in this case that matching against the index
> does not allow the match count unless we check MVCC as we go, but I
> don't see why another thread can't be doing that.


I believe PostgreSQL already considers using the index for "starts
with", so this wasn't part of the discussion for me. Sorry that this
wasn't clear.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
.. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:31 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com