This is a discussion on LIKE search and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Mark Lewis wrote: > PG could scan the index looking for matches first and only load the > actual ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Mark Lewis wrote: > PG could scan the index looking for matches first and only load the > actual rows if it found a match, but that could only be a possible win > if there were very few matches, because the difference in cost between a > full index scan and a sequential scan would need to be greater than the > cost of randomly fetching all of the matching data rows from the table > to look up the visibility information. Just out of curiosity: Does Postgress store a duplicate of the data in the index, even for long strings? I thought indexes only had to store the string up to the point where there was no ambiguity, for example, if I have "missing", "mississippi" and "misty", the index only needs "missin", "missis" and "mist" in the actual index. This would make it impossible to use a full index scan for a LIKE query. Craig ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Craig James wrote: > Mark Lewis wrote: > > >PG could scan the index looking for matches first and only load the > >actual rows if it found a match, but that could only be a possible win > >if there were very few matches, because the difference in cost between a > >full index scan and a sequential scan would need to be greater than the > >cost of randomly fetching all of the matching data rows from the table > >to look up the visibility information. > > Just out of curiosity: Does Postgress store a duplicate of the data in the > index, even for long strings? I thought indexes only had to store the > string up to the point where there was no ambiguity, for example, if I have > "missing", "mississippi" and "misty", the index only needs "missin", > "missis" and "mist" in the actual index. What would happen when you inserted a new tuple with just "miss"? You would need to expand all the other tuples in the index. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "Puedes vivir solo una vez, pero si lo haces bien, una vez es suficiente" ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Alvaro Herrera wrote: >> Just out of curiosity: Does Postgress store a duplicate of the data in the >> index, even for long strings? I thought indexes only had to store the >> string up to the point where there was no ambiguity, for example, if I have >> "missing", "mississippi" and "misty", the index only needs "missin", >> "missis" and "mist" in the actual index. > > What would happen when you inserted a new tuple with just "miss"? You > would need to expand all the other tuples in the index. That's right. This technique used by some index implementations is a tradeoff between size and update speed. Most words in most natural languages can be distinguished by the first few characters. The chances of having to modify more than a few surrounding nodes when you insert "miss" is small, so some implementations choose this method. Other implementations choose to store the full string. I was just curious which method Postgres uses. Craig ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| > PG could scan the index looking for matches first and only load the > actual rows if it found a match, but that could only be a possible win > if there were very few matches, because the difference in cost between a > full index scan and a sequential scan would need to be greater than the > cost of randomly fetching all of the matching data rows from the table > to look up the visibility information. If you need to do that kind of thing, ie. seq scanning a table checking only one column among a large table of many columns, then don't use an index. An index, being a btree, needs to be traversed in order (or else, a lot of locking problems come up) which means some random accesses. So, you could make a table, with 2 columns, updated via triggers : your text field, and the primary key of your main table. Scanning that would be faster. Still, a better solution for searching in text is : - tsearch2 if you need whole words - trigrams for any substring match - xapian for full text search with wildcards (ie. John* = Johnny) Speed-wise those three will beat any seq scan on a large table by a huge margin. ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| mark@mark.mielke.cc wrote: >> And since it's basically impossible to know the selectivity of this kind >> of where condition, I doubt the planner would ever realistically want to >> choose that plan anyway because of its poor worst-case behavior. > > What is a real life example where an intelligent and researched > database application would issue a like or ilike query as their > primary condition in a situation where they expected very high > selectivity? > > Avoiding a poor worst-case behaviour for a worst-case behaviour that > won't happen doesn't seem practical. But if you are also filtering on e.g. date, and that has an index with good selectivity, you're never going to use the text index anyway are you? If you've only got a dozen rows to check against, might as well just read them in. The only time it's worth considering the behaviour at all is *if* the worst-case is possible. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote: > mark@mark.mielke.cc wrote: > >>And since it's basically impossible to know the selectivity of this kind > >>of where condition, I doubt the planner would ever realistically want to > >>choose that plan anyway because of its poor worst-case behavior. > >What is a real life example where an intelligent and researched > >database application would issue a like or ilike query as their > >primary condition in a situation where they expected very high > >selectivity? > >Avoiding a poor worst-case behaviour for a worst-case behaviour that > >won't happen doesn't seem practical. > But if you are also filtering on e.g. date, and that has an index with > good selectivity, you're never going to use the text index anyway are > you? If you've only got a dozen rows to check against, might as well > just read them in. > The only time it's worth considering the behaviour at all is *if* the > worst-case is possible. I notice you did not provide a real life example as requested. :-) This seems like an ivory tower restriction. Not allowing best performance in a common situation vs not allowing worst performance in a not-so-common situation. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ .. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| mark@mark.mielke.cc wrote: > On Fri, May 25, 2007 at 09:13:25AM +0100, Richard Huxton wrote: >> mark@mark.mielke.cc wrote: >>>> And since it's basically impossible to know the selectivity of this kind >>>> of where condition, I doubt the planner would ever realistically want to >>>> choose that plan anyway because of its poor worst-case behavior. >>> What is a real life example where an intelligent and researched >>> database application would issue a like or ilike query as their >>> primary condition in a situation where they expected very high >>> selectivity? >>> Avoiding a poor worst-case behaviour for a worst-case behaviour that >>> won't happen doesn't seem practical. >> But if you are also filtering on e.g. date, and that has an index with >> good selectivity, you're never going to use the text index anyway are >> you? If you've only got a dozen rows to check against, might as well >> just read them in. >> The only time it's worth considering the behaviour at all is *if* the >> worst-case is possible. > > I notice you did not provide a real life example as requested. :-) OK - any application that allows user-built queries: <choose column: foo> <choose filter: contains> <choose target: "bar"> Want another? Any application that has a "search by name" box - users can (and do) put one letter in and hit enter. Unfortunately you don't always have control over the selectivity of queries issued. > This seems like an ivory tower restriction. Not allowing best performance > in a common situation vs not allowing worst performance in a not-so-common > situation. What best performance plan are you thinking of? I'm assuming we're talking about trailing-wildcard matches here, rather than "contains" style matches. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| > OK - any application that allows user-built queries: <choose column: > foo> <choose filter: contains> <choose target: "bar"> > > Want another? Any application that has a "search by name" box - users > can (and do) put one letter in and hit enter. > > Unfortunately you don't always have control over the selectivity of > queries issued. -*- HOW TO MAKE A SEARCH FORM -*- Imagine you have to code the search on IMDB. This is what a smart developer would do First, he uses AJAX autocompletion, so the thing is reactive. Then, he does not bother the user with a many-fields form. Instead of forcing the user to think (users HATE that), he writes smart code. Does Google Maps have separate fields for country, city, street, zipcode ? No. Because Google is about as smart as it gets. So, you parse the user query. If the user types, for instance, less than 3 letters (say, spi), he probably wants stuff that *begins* with those letters. There is no point in searching for the letter "a" in a million movie titles database. So, if the user types "spi", you display "name LIKE spi%", which is indexed, very fast. And since you're smart, you use AJAX. And you display only the most popular results (ie. most clicked on). http://imdb.com/find?s=all&q=spi Since 99% of the time the user wanted "spiderman" or "spielberg", you're done and he's happy. Users like being happy. If the user just types "a", you display the first 10 things that start with "a", this is useless but the user will marvel at your AJAX skillz. Then he will probably type in a few other letters. Then, if the user uses his space bar and types "spi 1980" you'll recognize a year and display spielberg's movies in 1980. Converting your strings to phonetics is also a good idea since about 0.7% of the l33T teenagers can spell stuff especially spiElberg. Only the guy who wants to know who had sex with marilyn monroe on the 17th day of the shooting of Basic Instinct will need to use the Advanced search. If you detect several words, then switch to a prefix-based fulltext search like Xapian which utterly rocks. Example : the user types "savin priv", you search for "savin*" NEAR "priv*" and you display "saving private ryan" before he has even finished typing the second word of his query. Users love that, they feel understood, they will click on your ads and buy your products. In all cases, search results should be limited to less than 100 to be easy on the database. The user doesn't care about a search returning more than 10-20 results, he will just rephrase the query, and the time taken to fetch those thousands of records with name LIKE '%a%' will have been utterly lost. Who goes to page 2 in google results ? BOTTOM LINE : databases don't think, you do. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote: > >I notice you did not provide a real life example as requested. :-) > OK - any application that allows user-built queries: <choose column: > foo> <choose filter: contains> <choose target: "bar"> > Want another? Any application that has a "search by name" box - users > can (and do) put one letter in and hit enter. > Unfortunately you don't always have control over the selectivity of > queries issued. The database has 10 million records. The user enters "bar" and it translates to "%bar%". You are suggesting that we expect bar to match 1 million+ records? :-) I hope not. I would define this as bad process. I would also use "LIMIT" to something like "100". > >This seems like an ivory tower restriction. Not allowing best performance > >in a common situation vs not allowing worst performance in a not-so-common > >situation. > What best performance plan are you thinking of? I'm assuming we're > talking about trailing-wildcard matches here, rather than "contains" > style matches. "Trailing-wildcard" already uses B-Tree index, does it not? I am speaking of contains, as contains is the one that was said to require a seqscan. I am questioning why it requires a seqscan. The claim was made that with MVCC, the index is insufficient to check for visibility and that the table would need to be accessed anyways, therefore a seqscan is required. I question whether a like '%bar%' should be considered a high selectivity query in the general case. I question whether a worst case should be assumed. Perhaps I question too much? :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ .. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| mark@mark.mielke.cc wrote: > I am speaking of contains, as contains is the one that was said to > require a seqscan. I am questioning why it requires a seqscan. The > claim was made that with MVCC, the index is insufficient to check > for visibility and that the table would need to be accessed anyways, > therefore a seqscan is required. I question whether a like '%bar%' > should be considered a high selectivity query in the general case. > I question whether a worst case should be assumed. If you are doing %bar% you should be using pg_tgrm or tsearch2. J > > Perhaps I question too much? :-) > > Cheers, > mark > ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |