Unix Technical Forum

tsearch2 and hyphenated terms

This is a discussion on tsearch2 and hyphenated terms within the Pgsql General forums, part of the PostgreSQL category; --> I'd like to use tsearch2 to index protein and gene names. Unfortunately, such names are written inconsistently and sometimes ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 02:05 AM
Reece Hart
 
Posts: n/a
Default tsearch2 and hyphenated terms

I'd like to use tsearch2 to index protein and gene names. Unfortunately,
such names are written inconsistently and sometimes with hyphens. For
example, MCL-1 and MCL1 are semantically equivalent but with the default
parser and to_tsvector, I see this:

unison@u8.3=> select to_tsvector('MCL1 MCL-1');
to_tsvector
-------------------------
'-1':3 'mcl':2 'mcl1':1

For the purposes of indexing these names, I suspect I'd get the majority
of cases by removing a hyphen when it's followed by 1 or 2 chars from
[a-zA-Z0-9]. Does that require a custom parser?

Thanks,
Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 02:05 AM
Tom Lane
 
Posts: n/a
Default Re: tsearch2 and hyphenated terms

Reece Hart <reece@harts.net> writes:
> For the purposes of indexing these names, I suspect I'd get the majority
> of cases by removing a hyphen when it's followed by 1 or 2 chars from
> [a-zA-Z0-9]. Does that require a custom parser?


Yeah, looks like it:

regression=# select * from ts_debug('MCL1 MCL-1');
alias | description | token | dictionaries | dictionary | lexemes
-----------+--------------------------+-------+----------------+--------------+---------
numword | Word, letters and digits | MCL1 | {simple} | simple | {mcl1}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | MCL | {english_stem} | english_stem | {mcl}
int | Signed integer | -1 | {simple} | simple | {-1}
(4 rows)

I had thought you might get a "numhword" output, but that only seems to
happen if there's at least one letter after the dash:

regression=# select * from ts_debug('MCL1 MCL-X1');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+------------------------------------------+--------+----------------+--------------+----------
numword | Word, letters and digits | MCL1 | {simple} | simple | {mcl1}
blank | Space symbols | | {} | |
numhword | Hyphenated word, letters and digits | MCL-X1 | {simple} | simple | {mcl-x1}
hword_asciipart | Hyphenated word part, all ASCII | MCL | {english_stem} | english_stem | {mcl}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | X1 | {simple} | simple | {x1}
(6 rows)

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 02:05 AM
Oleg Bartunov
 
Posts: n/a
Default Re: tsearch2 and hyphenated terms

We have the same problem with names in astronomy, so we implemented
dict_regex http://vo.astronet.ru/arxiv/dict_regex.html
Check it out !

Oleg
On Thu, 10 Apr 2008, Reece Hart wrote:

> I'd like to use tsearch2 to index protein and gene names. Unfortunately,
> such names are written inconsistently and sometimes with hyphens. For
> example, MCL-1 and MCL1 are semantically equivalent but with the default
> parser and to_tsvector, I see this:
>
> unison@u8.3=> select to_tsvector('MCL1 MCL-1');
> to_tsvector
> -------------------------
> '-1':3 'mcl':2 'mcl1':1
>
> For the purposes of indexing these names, I suspect I'd get the majority
> of cases by removing a hyphen when it's followed by 1 or 2 chars from
> [a-zA-Z0-9]. Does that require a custom parser?
>
> Thanks,
> Reece
>
>


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 02:05 AM
Reece Hart
 
Posts: n/a
Default Re: tsearch2 and hyphenated terms

On Fri, 2008-04-11 at 22:07 +0400, Oleg Bartunov wrote:
> We have the same problem with names in astronomy, so we implemented
> dict_regex http://vo.astronet.ru/arxiv/dict_regex.html
> Check it out !


Oleg-

This gets me a lot closer. Thank you. I have two remaining problems.


The first problem is that 'bcl-w' and 'bcl-2' are parsed differently,
like so:

unison@u8.3=> select * from ts_debug('english','bcl-w');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------+----------------+--------------+---------
asciihword | Hyphenated word, all ASCII | bcl-w | {english_stem} | english_stem | {bcl-w}
hword_asciipart | Hyphenated word part, all ASCII | bcl | {english_stem} | english_stem | {bcl}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | w | {english_stem} | english_stem | {w}

unison@u8.3=> select * from ts_debug('english','bcl-2');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | bcl | {english_stem} | english_stem | {bcl}
int | Signed integer | -2 | {simple} | simple | {-2}

One option would be to write a new parser/modify wparser_def.c to make
the InHyphyenWordFirst accept p_isdigit or p_isalnum on the first
character (I think I got this right). This would achieve Tom's initial
inkling that Bcl-2 might be parsed as a numhword and (to me) it seems
more congruent with asciihword class.

Perhaps a more broadly useful modification is for the lexer to also emit
whitespace-delimited tokens (period). asciihword almost does the trick,
but it too requires a post-hyphen alphabetic character.



The second problem is with quantifiers on PCRE's regexps. I initially
implemented a dict_regex with a conf line like
(\w+)-(\w{1,2}) $1$2
I can make simpler expressions work (eg., (bcl)-(\w)). I think it must
be related to the README caveat regarding PCRE partial matching mode,
which I didn't understand initially.

However, I don't see that it's possible to write a general regexp like
the one I initially tried. Do you have any suggestions?


Thanks again. I'm very impressed with tsearch2.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:41 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com