This is a discussion on LIKE search and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> PFC wrote: > >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| PFC wrote: > >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> <choose target: "bar"> >> >> Want another? Any application that has a "search by name" box - users >> can (and do) put one letter in and hit enter. >> >> Unfortunately you don't always have control over the selectivity of >> queries issued. > > -*- HOW TO MAKE A SEARCH FORM -*- > > Imagine you have to code the search on IMDB. > > This is what a smart developer would do All good domain-specific tips to provide users with a satisfying search-experience. None of which address the question of what plan PG should produce for: SELECT * FROM bigtable WHERE foo LIKE 's%' -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| > None of which address the question of what plan PG should produce for: > SELECT * FROM bigtable WHERE foo LIKE 's%' Ah, this one already uses the btree since the '%' is at the end. My point is that a search like this will yield too many results to be useful to the user anyway, so optimizing its performance is a kind of red herring. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| mark@mark.mielke.cc wrote: > On Fri, May 25, 2007 at 04:35:22PM +0100, Richard Huxton wrote: >>> I notice you did not provide a real life example as requested. :-) >> OK - any application that allows user-built queries: <choose column: >> foo> <choose filter: contains> <choose target: "bar"> >> Want another? Any application that has a "search by name" box - users >> can (and do) put one letter in and hit enter. >> Unfortunately you don't always have control over the selectivity of >> queries issued. > > The database has 10 million records. The user enters "bar" and it > translates to "%bar%". You are suggesting that we expect bar to match > 1 million+ records? :-) I was saying that you don't know. At least, I don't know of any cheap way of gathering full substring stats or doing a full substring indexing. Even tsearch2 can't do that. > I hope not. I would define this as bad process. I would also use "LIMIT" > to something like "100". Yes, but that's not the query we're talking about is it? If possible you don't do '%bar%' searches at all. If you do, you try to restrict it further or LIMIT the results. There's nothing to discuss in these cases. >>> This seems like an ivory tower restriction. Not allowing best performance >>> in a common situation vs not allowing worst performance in a not-so-common >>> situation. >> What best performance plan are you thinking of? I'm assuming we're >> talking about trailing-wildcard matches here, rather than "contains" >> style matches. > > "Trailing-wildcard" already uses B-Tree index, does it not? Yes, it searches the btree and then checks the data for visibility. I thought that was what you felt could be worked around. It appears I was wrong. > I am speaking of contains, as contains is the one that was said to > require a seqscan. I am questioning why it requires a seqscan. Well, you seemed to be suggesting you had something better in mind. At least, that was my reading of your original post. > The > claim was made that with MVCC, the index is insufficient to check > for visibility True, for PG's implementation of MVCC. You *could* have visibility in each index, but that obviously would take more space. For a table with many indexes, that could be a *lot* more space. You also have to update all that visibilty information too. > and that the table would need to be accessed anyways, > therefore a seqscan is required. I question whether a like '%bar%' > should be considered a high selectivity query in the general case. > I question whether a worst case should be assumed. Well, the general rule-of-thumb is only about 10% for the changeover between index & seq-scan. That is, once you are reading 10% of the rows on disk (to check visibility) you might as well read them all (since you'll be reading most of the blocks anyway if the rows are randomly distributed). If you are doing SELECT * from that table then you'll want all that data you read. If you are doing SELECT count(*) then you only wanted the visibility :-( Now you and I can look at a substring and probably make a good guess how common it is (assuming we know the targets are British surnames or Japanese towns). PG needs one number - or rather, it picks one number for each length of search-string (afaik). > Perhaps I question too much? :-) Not sure it's possible to question too much :-) However, you need to provide answers occasionally too - what numbers would you pick? :-) -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| PFC wrote: >> None of which address the question of what plan PG should produce for: >> SELECT * FROM bigtable WHERE foo LIKE 's%' > > Ah, this one already uses the btree since the '%' is at the end. > My point is that a search like this will yield too many results to > be useful to the user anyway, so optimizing its performance is a kind of > red herring. At the *application level* yes. At the *query planner* level no. At the query planner level I just want it to come up with the best plan it can. The original argument was that PG's estimate of the number of matching rows was too optimistic (or pessimistic) in the case where we are doing a contains substring-search. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| "Richard Huxton" <dev@archonet.com> writes: > Now you and I can look at a substring and probably make a good guess how common > it is (assuming we know the targets are British surnames or Japanese towns). PG > needs one number - or rather, it picks one number for each length of > search-string (afaik). I don't think that's true. Postgres calculates the lower and upper bound implied by the search pattern and then uses the histogram to estimate how selective that range is. It's sometimes surprisingly good but obviously it's not perfect. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Gregory Stark wrote: > "Richard Huxton" <dev@archonet.com> writes: > >> Now you and I can look at a substring and probably make a good guess how common >> it is (assuming we know the targets are British surnames or Japanese towns). PG >> needs one number - or rather, it picks one number for each length of >> search-string (afaik). > > I don't think that's true. Postgres calculates the lower and upper bound > implied by the search pattern and then uses the histogram to estimate how > selective that range is. It's sometimes surprisingly good but obviously it's > not perfect. Sorry - I'm obviously picking my words badly today. I meant for the "contains" substring match. It gives different (goes away and checks...yes) predictions based on string length. So it guesses that LIKE '%aaa%' will match more than LIKE '%aaaa%'. Of course, if we were matching surnames you and I could say that this is very unlikely, but without some big statistics table I guess there's not much more PG can do. For a trailing wildcard LIKE 'aaa%' it can and does as you say convert this into something along the lines of (>= 'aaa' AND < 'aab'). Although IIRC that depends if your locale allows such (not sure, I don't really use non-C/non-English locales enough). -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| mark@mark.mielke.cc wrote: > What is a real life example where an intelligent and researched > database application would issue a like or ilike query as their > primary condition in a situation where they expected very high > selectivity? > In my case the canonical example is to search against textual keys where the search is performed automatically if the user hs typed enough data and paused. In almost all cases the '%' trails, and I'm looking for 'starts with' in effect. usually the search will have a specified upper number of returned rows, if that's an available facility. I realise in this case that matching against the index does not allow the match count unless we check MVCC as we go, but I don't see why another thread can't be doing that. James ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| On Wed, Jun 06, 2007 at 11:23:13PM +0100, James Mansion wrote: > mark@mark.mielke.cc wrote: > >What is a real life example where an intelligent and researched > >database application would issue a like or ilike query as their > >primary condition in a situation where they expected very high > >selectivity? > In my case the canonical example is to search against textual keys > where the search is performed automatically if the user hs typed > enough data and paused. In almost all cases the '%' trails, and I'm > looking for 'starts with' in effect. usually the search will have a > specified upper number of returned rows, if that's an available > facility. I realise in this case that matching against the index > does not allow the match count unless we check MVCC as we go, but I > don't see why another thread can't be doing that. I believe PostgreSQL already considers using the index for "starts with", so this wasn't part of the discussion for me. Sorry that this wasn't clear. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ .. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |