This is a discussion on Re: Regexps - never completing join. within the Pgsql Performance forums, part of the PostgreSQL category; --> On May 13, 2008, at 11:45 PM, Rusty Conover wrote: > Hi Guys, > > I'm using postgresql 8.3.1 ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On May 13, 2008, at 11:45 PM, Rusty Conover wrote: > Hi Guys, > > I'm using postgresql 8.3.1 and I'm seeing weird behavior between > what I expect and what's happening when the query is executed > > I'm trying to match a table that contains regexps against another > table that is full of the text to match against so my query is: > > select wc_rule.id from classifications, wc_rule where > classifications.classification ~* wc_rule.regexp; > > When I run that the query takes a very very long time (never ending > so far 20 minutes or so) to execute. > > But if I loop through all of the rules and a query for each rule: > > select wc_rule.id from classifications, wc_rule where > classifications.classification ~* wc_rule.regexp and wc_rule.id = ? > > All of the rules when run individually can be matched in a little > under then 3 minutes. I'd assume postgres would be equal to or > faster with the single row execution method. > > The table schema: > > CREATE TABLE wc_rule ( > id integer NOT NULL, > regexp text, > ); > > CREATE TABLE classifications ( > id integer NOT NULL, > classification text NOT NULL > ); > > gb_render_1_db=# explain select wc_rule.id from classifications, > wc_rule where classifications.classification ~* wc_rule.regexp; > QUERY PLAN > ----------------------------------------------------------------------------- > Nested Loop (cost=13.71..891401.71 rows=197843 width=4) > Join Filter: (classifications.classification ~* wc_rule.regexp) > -> Seq Scan on classifications (cost=0.00..1093.46 rows=56446 > width=42) > -> Materialize (cost=13.71..20.72 rows=701 width=22) > -> Seq Scan on wc_rule (cost=0.00..13.01 rows=701 width=22) > (5 rows) > > As a followup I did some digging: by editing: src/backend/utils/adt/regexp.c and increasing the cache size for regular expressions to an arbitrarily large number #define MAX_CACHED_RES 3200 Rather then the default of #define MAX_CACHED_RES 32 I was able to get the query to complete in a respectable amount of time: gb_render_1_db=# explain analyze select wc_rule.id from classifications, wc_rule where classifications.classification ~* wc_rule.regexp; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=13.71..891401.71 rows=197843 width=4) (actual time=72.714..366899.913 rows=55052 loops=1) Join Filter: (classifications.classification ~* wc_rule.regexp) -> Seq Scan on classifications (cost=0.00..1093.46 rows=56446 width=42) (actual time=28.820..109.895 rows=56446 loops=1) -> Materialize (cost=13.71..20.72 rows=701 width=22) (actual time=0.000..0.193 rows=701 loops=56446) -> Seq Scan on wc_rule (cost=0.00..13.01 rows=701 width=22) (actual time=0.030..0.593 rows=701 loops=1) Total runtime: 366916.632 ms (6 rows) Which is still > 6 minutes, but at least it completed. I'll keep digging into what is causing this bad performance. Thanks, Rusty -- Rusty Conover InfoGears Inc. http://www.infogears.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover@infogears.com> wrote: > Returning to this problem this morning, I made some more insight. > > One way I did find that worked to control the loop (but doesn't yield the > same results because its a left join) > > select wc_rule.id from wc_rule left join classifications on > classifications.classification ~* wc_rule.regexp; If you do that and exclude the extra rows added to the right with somthing like and wc_rule.somefield IS NOT NULL does it run fast and give you the same answers as the regular join? I'm guessing that this could be optimized to use a hash agg method of joining for text, but I'm no expert on the subject. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On May 16, 2008, at 2:35 PM, Scott Marlowe wrote: > On Wed, May 14, 2008 at 9:33 AM, Rusty Conover > <rconover@infogears.com> wrote: >> Returning to this problem this morning, I made some more insight. >> >> One way I did find that worked to control the loop (but doesn't >> yield the >> same results because its a left join) >> >> select wc_rule.id from wc_rule left join classifications on >> classifications.classification ~* wc_rule.regexp; > > If you do that and exclude the extra rows added to the right with > somthing like > > and wc_rule.somefield IS NOT NULL > > does it run fast and give you the same answers as the regular join? > > I'm guessing that this could be optimized to use a hash agg method of > joining for text, but I'm no expert on the subject. Hi Scott, It's not really a hash agg problem really just a looping inside/ outside table selection problem. The slowdown is really the compilation of the regexp repeatedly by RE_compile_and_cache() because the regexps are being run on the inside of the loop rather then the outside. And since the regexp cache is only 32 items big, the every match is resulting in a recompilation of the regexp since I have about 700 regexps. Thanks, Rusty -- Rusty Conover InfoGears Inc. http://www.infogears.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| ||||
| On Fri, May 16, 2008 at 3:37 PM, Rusty Conover <rconover@infogears.com> wrote: > > On May 16, 2008, at 2:35 PM, Scott Marlowe wrote: > >> On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover@infogears.com> >> wrote: >>> >>> Returning to this problem this morning, I made some more insight. >>> >>> One way I did find that worked to control the loop (but doesn't yield the >>> same results because its a left join) >>> >>> select wc_rule.id from wc_rule left join classifications on >>> classifications.classification ~* wc_rule.regexp; >> >> If you do that and exclude the extra rows added to the right with somthing >> like >> >> and wc_rule.somefield IS NOT NULL >> >> does it run fast and give you the same answers as the regular join? >> >> I'm guessing that this could be optimized to use a hash agg method of >> joining for text, but I'm no expert on the subject. > > Hi Scott, > > It's not really a hash agg problem really just a looping inside/outside > table selection problem. > > The slowdown is really the compilation of the regexp repeatedly by > RE_compile_and_cache() because the regexps are being run on the inside of > the loop rather then the outside. And since the regexp cache is only 32 > items big, the every match is resulting in a recompilation of the regexp > since I have about 700 regexps. That's not what I meant. What I meant was it seems like a good candidate for a hash aggregate solution. I'm pretty sure pgsql can't use hashagg for something like this right now. If you hashagged each regexp and each column fed through it, you could probably get good performance. but that's a backend hacker thing, not something I'd know how to do. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |