Unix Technical Forum

Re: Regexps - never completing join.

This is a discussion on Re: Regexps - never completing join. within the Pgsql Performance forums, part of the PostgreSQL category; --> On May 13, 2008, at 11:45 PM, Rusty Conover wrote: > Hi Guys, > > I'm using postgresql 8.3.1 ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-16-2008, 02:44 PM
Rusty Conover
 
Posts: n/a
Default Re: Regexps - never completing join.




On May 13, 2008, at 11:45 PM, Rusty Conover wrote:

> Hi Guys,
>
> I'm using postgresql 8.3.1 and I'm seeing weird behavior between
> what I expect and what's happening when the query is executed
>
> I'm trying to match a table that contains regexps against another
> table that is full of the text to match against so my query is:
>
> select wc_rule.id from classifications, wc_rule where
> classifications.classification ~* wc_rule.regexp;
>
> When I run that the query takes a very very long time (never ending
> so far 20 minutes or so) to execute.
>
> But if I loop through all of the rules and a query for each rule:
>
> select wc_rule.id from classifications, wc_rule where
> classifications.classification ~* wc_rule.regexp and wc_rule.id = ?
>
> All of the rules when run individually can be matched in a little
> under then 3 minutes. I'd assume postgres would be equal to or
> faster with the single row execution method.
>
> The table schema:
>
> CREATE TABLE wc_rule (
> id integer NOT NULL,
> regexp text,
> );
>
> CREATE TABLE classifications (
> id integer NOT NULL,
> classification text NOT NULL
> );
>
> gb_render_1_db=# explain select wc_rule.id from classifications,
> wc_rule where classifications.classification ~* wc_rule.regexp;
> QUERY PLAN
> -----------------------------------------------------------------------------
> Nested Loop (cost=13.71..891401.71 rows=197843 width=4)
> Join Filter: (classifications.classification ~* wc_rule.regexp)
> -> Seq Scan on classifications (cost=0.00..1093.46 rows=56446
> width=42)
> -> Materialize (cost=13.71..20.72 rows=701 width=22)
> -> Seq Scan on wc_rule (cost=0.00..13.01 rows=701 width=22)
> (5 rows)
>
>



As a followup I did some digging:

by editing:

src/backend/utils/adt/regexp.c

and increasing the cache size for regular expressions to an
arbitrarily large number

#define MAX_CACHED_RES 3200

Rather then the default of

#define MAX_CACHED_RES 32

I was able to get the query to complete in a respectable amount of time:

gb_render_1_db=# explain analyze select wc_rule.id from
classifications, wc_rule where classifications.classification ~*
wc_rule.regexp;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=13.71..891401.71 rows=197843 width=4) (actual
time=72.714..366899.913 rows=55052 loops=1)
Join Filter: (classifications.classification ~* wc_rule.regexp)
-> Seq Scan on classifications (cost=0.00..1093.46 rows=56446
width=42) (actual time=28.820..109.895 rows=56446 loops=1)
-> Materialize (cost=13.71..20.72 rows=701 width=22) (actual
time=0.000..0.193 rows=701 loops=56446)
-> Seq Scan on wc_rule (cost=0.00..13.01 rows=701
width=22) (actual time=0.030..0.593 rows=701 loops=1)
Total runtime: 366916.632 ms
(6 rows)

Which is still > 6 minutes, but at least it completed.

I'll keep digging into what is causing this bad performance.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 05-18-2008, 11:04 PM
Scott Marlowe
 
Posts: n/a
Default Re: Regexps - never completing join.

On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover@infogears.com> wrote:
> Returning to this problem this morning, I made some more insight.
>
> One way I did find that worked to control the loop (but doesn't yield the
> same results because its a left join)
>
> select wc_rule.id from wc_rule left join classifications on
> classifications.classification ~* wc_rule.regexp;


If you do that and exclude the extra rows added to the right with somthing like

and wc_rule.somefield IS NOT NULL

does it run fast and give you the same answers as the regular join?

I'm guessing that this could be optimized to use a hash agg method of
joining for text, but I'm no expert on the subject.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 05-18-2008, 11:04 PM
Rusty Conover
 
Posts: n/a
Default Re: Regexps - never completing join.


On May 16, 2008, at 2:35 PM, Scott Marlowe wrote:

> On Wed, May 14, 2008 at 9:33 AM, Rusty Conover
> <rconover@infogears.com> wrote:
>> Returning to this problem this morning, I made some more insight.
>>
>> One way I did find that worked to control the loop (but doesn't
>> yield the
>> same results because its a left join)
>>
>> select wc_rule.id from wc_rule left join classifications on
>> classifications.classification ~* wc_rule.regexp;

>
> If you do that and exclude the extra rows added to the right with
> somthing like
>
> and wc_rule.somefield IS NOT NULL
>
> does it run fast and give you the same answers as the regular join?
>
> I'm guessing that this could be optimized to use a hash agg method of
> joining for text, but I'm no expert on the subject.


Hi Scott,

It's not really a hash agg problem really just a looping inside/
outside table selection problem.

The slowdown is really the compilation of the regexp repeatedly by
RE_compile_and_cache() because the regexps are being run on the inside
of the loop rather then the outside. And since the regexp cache is
only 32 items big, the every match is resulting in a recompilation of
the regexp since I have about 700 regexps.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 05-18-2008, 11:04 PM
Scott Marlowe
 
Posts: n/a
Default Re: Regexps - never completing join.

On Fri, May 16, 2008 at 3:37 PM, Rusty Conover <rconover@infogears.com> wrote:
>
> On May 16, 2008, at 2:35 PM, Scott Marlowe wrote:
>
>> On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover@infogears.com>
>> wrote:
>>>
>>> Returning to this problem this morning, I made some more insight.
>>>
>>> One way I did find that worked to control the loop (but doesn't yield the
>>> same results because its a left join)
>>>
>>> select wc_rule.id from wc_rule left join classifications on
>>> classifications.classification ~* wc_rule.regexp;

>>
>> If you do that and exclude the extra rows added to the right with somthing
>> like
>>
>> and wc_rule.somefield IS NOT NULL
>>
>> does it run fast and give you the same answers as the regular join?
>>
>> I'm guessing that this could be optimized to use a hash agg method of
>> joining for text, but I'm no expert on the subject.

>
> Hi Scott,
>
> It's not really a hash agg problem really just a looping inside/outside
> table selection problem.
>
> The slowdown is really the compilation of the regexp repeatedly by
> RE_compile_and_cache() because the regexps are being run on the inside of
> the loop rather then the outside. And since the regexp cache is only 32
> items big, the every match is resulting in a recompilation of the regexp
> since I have about 700 regexps.


That's not what I meant. What I meant was it seems like a good
candidate for a hash aggregate solution. I'm pretty sure pgsql can't
use hashagg for something like this right now.

If you hashagged each regexp and each column fed through it, you could
probably get good performance. but that's a backend hacker thing, not
something I'd know how to do.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:23 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com