Unix Technical Forum

LIKE search and performance

This is a discussion on LIKE search and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Mark Lewis wrote: > PG could scan the index looking for matches first and only load the > actual ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

 

LinkBack Thread Tools Display Modes
  #11 (permalink)  
Old 04-19-2008, 10:50 AM
Craig James
 
Posts: n/a
Default Re: LIKE search and performance

Mark Lewis wrote:

> PG could scan the index looking for matches first and only load the
> actual rows if it found a match, but that could only be a possible win
> if there were very few matches, because the difference in cost between a
> full index scan and a sequential scan would need to be greater than the
> cost of randomly fetching all of the matching data rows from the table
> to look up the visibility information.


Just out of curiosity: Does Postgress store a duplicate of the data in the index, even for long strings? I thought indexes only had to store the string up to the point where there was no ambiguity, for example, if I have "missing", "mississippi" and "misty", the index only needs "missin", "missis" and "mist" in the actual index. This would make it impossible to use a full index scan for a LIKE query.

Craig

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-19-2008, 10:50 AM
Alvaro Herrera
 
Posts: n/a
Default Re: LIKE search and performance

Craig James wrote:
> Mark Lewis wrote:
>
> >PG could scan the index looking for matches first and only load the
> >actual rows if it found a match, but that could only be a possible win
> >if there were very few matches, because the difference in cost between a
> >full index scan and a sequential scan would need to be greater than the
> >cost of randomly fetching all of the matching data rows from the table
> >to look up the visibility information.

>
> Just out of curiosity: Does Postgress store a duplicate of the data in the
> index, even for long strings? I thought indexes only had to store the
> string up to the point where there was no ambiguity, for example, if I have
> "missing", "mississippi" and "misty", the index only needs "missin",
> "missis" and "mist" in the actual index.


What would happen when you inserted a new tuple with just "miss"? You
would need to expand all the other tuples in the index.

--
Alvaro Herrera http://www.flickr.com/photos/alvherre/
"Puedes vivir solo una vez, pero si lo haces bien, una vez es suficiente"

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-19-2008, 10:50 AM
Craig James
 
Posts: n/a
Default Re: LIKE search and performance

Alvaro Herrera wrote:
>> Just out of curiosity: Does Postgress store a duplicate of the data in the
>> index, even for long strings? I thought indexes only had to store the
>> string up to the point where there was no ambiguity, for example, if I have
>> "missing", "mississippi" and "misty", the index only needs "missin",
>> "missis" and "mist" in the actual index.

>
> What would happen when you inserted a new tuple with just "miss"? You
> would need to expand all the other tuples in the index.


That's right. This technique used by some index implementations is a tradeoff between size and update speed. Most words in most natural languages can be distinguished by the first few characters. The chances of having to modify more than a few surrounding nodes when you insert "miss" is small, so some implementations choose this method. Other implementations choose to store the full string. I was just curious which method Postgres uses.

Craig


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-19-2008, 10:50 AM
PFC
 
Posts: n/a
Default Re: LIKE search and performance


> PG could scan the index looking for matches first and only load the
> actual rows if it found a match, but that could only be a possible win
> if there were very few matches, because the difference in cost between a
> full index scan and a sequential scan would need to be greater than the
> cost of randomly fetching all of the matching data rows from the table
> to look up the visibility information.


If you need to do that kind of thing, ie. seq scanning a table checking
only one column among a large table of many columns, then don't use an
index. An index, being a btree, needs to be traversed in order (or else, a
lot of locking problems come up) which means some random accesses.

So, you could make a table, with 2 columns, updated via triggers : your
text field, and the primary key of your main table. Scanning that would be
faster.

Still, a better solution for searching in text is :

- tsearch2 if you need whole words
- trigrams for any substring match
- xapian for full text search with wildcards (ie. John* = Johnny)

Speed-wise those three will beat any seq scan on a large table by a huge
margin.

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-19-2008, 10:51 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

mark@mark.mielke.cc wrote:
>> And since it's basically impossible to know the selectivity of this kind
>> of where condition, I doubt the planner would ever realistically want to
>> choose that plan anyway because of its poor worst-case behavior.

>
> What is a real life example where an intelligent and researched
> database application would issue a like or ilike query as their
> primary condition in a situation where they expected very high
> selectivity?
>
> Avoiding a poor worst-case behaviour for a worst-case behaviour that
> won't happen doesn't seem practical.


But if you are also filtering on e.g. date, and that has an index with
good selectivity, you're never going to use the text index anyway are
you? If you've only got a dozen rows to check against, might as well
just read them in.

The only time it's worth considering the behaviour at all is *if* the
worst-case is possible.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-19-2008, 10:51 AM
mark@mark.mielke.cc
 
Posts: n/a
Default Re: LIKE search and performance

On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote:
> mark@mark.mielke.cc wrote:
> >>And since it's basically impossible to know the selectivity of this kind
> >>of where condition, I doubt the planner would ever realistically want to
> >>choose that plan anyway because of its poor worst-case behavior.

> >What is a real life example where an intelligent and researched
> >database application would issue a like or ilike query as their
> >primary condition in a situation where they expected very high
> >selectivity?
> >Avoiding a poor worst-case behaviour for a worst-case behaviour that
> >won't happen doesn't seem practical.

> But if you are also filtering on e.g. date, and that has an index with
> good selectivity, you're never going to use the text index anyway are
> you? If you've only got a dozen rows to check against, might as well
> just read them in.
> The only time it's worth considering the behaviour at all is *if* the
> worst-case is possible.


I notice you did not provide a real life example as requested. :-)

This seems like an ivory tower restriction. Not allowing best performance
in a common situation vs not allowing worst performance in a not-so-common
situation.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
.. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-19-2008, 10:51 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

mark@mark.mielke.cc wrote:
> On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote:
>> mark@mark.mielke.cc wrote:
>>>> And since it's basically impossible to know the selectivity of this kind
>>>> of where condition, I doubt the planner would ever realistically want to
>>>> choose that plan anyway because of its poor worst-case behavior.
>>> What is a real life example where an intelligent and researched
>>> database application would issue a like or ilike query as their
>>> primary condition in a situation where they expected very high
>>> selectivity?
>>> Avoiding a poor worst-case behaviour for a worst-case behaviour that
>>> won't happen doesn't seem practical.

>> But if you are also filtering on e.g. date, and that has an index with
>> good selectivity, you're never going to use the text index anyway are
>> you? If you've only got a dozen rows to check against, might as well
>> just read them in.
>> The only time it's worth considering the behaviour at all is *if* the
>> worst-case is possible.

>
> I notice you did not provide a real life example as requested. :-)


OK - any application that allows user-built queries: <choose column:
foo> <choose filter: contains> <choose target: "bar">

Want another? Any application that has a "search by name" box - users
can (and do) put one letter in and hit enter.

Unfortunately you don't always have control over the selectivity of
queries issued.

> This seems like an ivory tower restriction. Not allowing best performance
> in a common situation vs not allowing worst performance in a not-so-common
> situation.


What best performance plan are you thinking of? I'm assuming we're
talking about trailing-wildcard matches here, rather than "contains"
style matches.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #18 (permalink)  
Old 04-19-2008, 10:51 AM
PFC
 
Posts: n/a
Default Re: LIKE search and performance



> OK - any application that allows user-built queries: <choose column:
> foo> <choose filter: contains> <choose target: "bar">
>
> Want another? Any application that has a "search by name" box - users
> can (and do) put one letter in and hit enter.
>
> Unfortunately you don't always have control over the selectivity of
> queries issued.


-*- HOW TO MAKE A SEARCH FORM -*-

Imagine you have to code the search on IMDB.

This is what a smart developer would do

First, he uses AJAX autocompletion, so the thing is reactive.
Then, he does not bother the user with a many-fields form. Instead of
forcing the user to think (users HATE that), he writes smart code.
Does Google Maps have separate fields for country, city, street, zipcode
? No. Because Google is about as smart as it gets.

So, you parse the user query.

If the user types, for instance, less than 3 letters (say, spi), he
probably wants stuff that *begins* with those letters. There is no point
in searching for the letter "a" in a million movie titles database.
So, if the user types "spi", you display "name LIKE spi%", which is
indexed, very fast. And since you're smart, you use AJAX. And you display
only the most popular results (ie. most clicked on).

http://imdb.com/find?s=all&q=spi

Since 99% of the time the user wanted "spiderman" or "spielberg", you're
done and he's happy. Users like being happy.
If the user just types "a", you display the first 10 things that start
with "a", this is useless but the user will marvel at your AJAX skillz.
Then he will probably type in a few other letters.

Then, if the user uses his space bar and types "spi 1980" you'll
recognize a year and display spielberg's movies in 1980.
Converting your strings to phonetics is also a good idea since about 0.7%
of the l33T teenagers can spell stuff especially spiElberg.

Only the guy who wants to know who had sex with marilyn monroe on the
17th day of the shooting of Basic Instinct will need to use the Advanced
search.

If you detect several words, then switch to a prefix-based fulltext
search like Xapian which utterly rocks.
Example : the user types "savin priv", you search for "savin*" NEAR
"priv*" and you display "saving private ryan" before he has even finished
typing the second word of his query. Users love that, they feel
understood, they will click on your ads and buy your products.

In all cases, search results should be limited to less than 100 to be
easy on the database. The user doesn't care about a search returning more
than 10-20 results, he will just rephrase the query, and the time taken to
fetch those thousands of records with name LIKE '%a%' will have been
utterly lost. Who goes to page 2 in google results ?

BOTTOM LINE : databases don't think, you do.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #19 (permalink)  
Old 04-19-2008, 10:51 AM
mark@mark.mielke.cc
 
Posts: n/a
Default Re: LIKE search and performance

On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote:
> >I notice you did not provide a real life example as requested. :-)

> OK - any application that allows user-built queries: <choose column:
> foo> <choose filter: contains> <choose target: "bar">
> Want another? Any application that has a "search by name" box - users
> can (and do) put one letter in and hit enter.
> Unfortunately you don't always have control over the selectivity of
> queries issued.


The database has 10 million records. The user enters "bar" and it
translates to "%bar%". You are suggesting that we expect bar to match
1 million+ records? :-)

I hope not. I would define this as bad process. I would also use "LIMIT"
to something like "100".

> >This seems like an ivory tower restriction. Not allowing best performance
> >in a common situation vs not allowing worst performance in a not-so-common
> >situation.

> What best performance plan are you thinking of? I'm assuming we're
> talking about trailing-wildcard matches here, rather than "contains"
> style matches.


"Trailing-wildcard" already uses B-Tree index, does it not?

I am speaking of contains, as contains is the one that was said to
require a seqscan. I am questioning why it requires a seqscan. The
claim was made that with MVCC, the index is insufficient to check
for visibility and that the table would need to be accessed anyways,
therefore a seqscan is required. I question whether a like '%bar%'
should be considered a high selectivity query in the general case.
I question whether a worst case should be assumed.

Perhaps I question too much? :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
.. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #20 (permalink)  
Old 04-19-2008, 10:51 AM
Joshua D. Drake
 
Posts: n/a
Default Re: LIKE search and performance

mark@mark.mielke.cc wrote:

> I am speaking of contains, as contains is the one that was said to
> require a seqscan. I am questioning why it requires a seqscan. The
> claim was made that with MVCC, the index is insufficient to check
> for visibility and that the table would need to be accessed anyways,
> therefore a seqscan is required. I question whether a like '%bar%'
> should be considered a high selectivity query in the general case.
> I question whether a worst case should be assumed.


If you are doing %bar% you should be using pg_tgrm or tsearch2.

J


>
> Perhaps I question too much? :-)
>
> Cheers,
> mark
>



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:38 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com