This is a discussion on regular expressions stranges within the pgsql Hackers forums, part of the PostgreSQL category; --> Regexp works differently with no-ascii characters depending on server encoding (bug.sql contains non-ascii char): % initdb -E KOI8-R --locale ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Regexp works differently with no-ascii characters depending on server encoding (bug.sql contains non-ascii char): % initdb -E KOI8-R --locale ru_RU.KOI8-R % psql postgres < bug.sql true ------ t (1 row) true | true ------+------ t | t (1 row) % initdb -E UTF8 --locale ru_RU.UTF-8 % psql postgres < bug.sql true ------ f (1 row) true | true ------+------ f | t (1 row) As I can see, that is because of using isalpha (and other is*), tolower & toupper instead of isw* and tow* functions. Is any reason to use them? If not, I can modify regc_locale.c similarly to tsearch2 locale part. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ set client_encoding='KOI8'; SELECT 'д' ~* '[[:alpha:]]' as "true"; SELECT 'Дорога' ~* 'дорога' as "true", 'дорога' ~* 'дорога' as "true"; ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Teodor Sigaev <teodor@sigaev.ru> writes: > As I can see, that is because of using isalpha (and other is*), tolower & > toupper instead of isw* and tow* functions. Is any reason to use them? If not, I > can modify regc_locale.c similarly to tsearch2 locale part. The regex code is working with pg_wchar strings, which aren't necessarily the same representation that the OS' wide-char functions expect. If we could guarantee compatibility then the above plan would make sense ... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| > The regex code is working with pg_wchar strings, which aren't > necessarily the same representation that the OS' wide-char functions > expect. If we could guarantee compatibility then the above plan > would make sense ... it seems to me, that is possible for UTF8 encoding. So isalpha() function may be defined as: static int pg_wc_isalpha(pg_wchar c) { if ( (c >= 0 && c <= UCHAR_MAX) ) return isalpha((unsigned char) c) #ifdef HAVE_WCSTOMBS else if ( GetDatabaseEncoding() == PG_UTF8 ) return iswalpha((wint_t) c) #endif return 0; } -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| Teodor Sigaev <teodor@sigaev.ru> writes: >> The regex code is working with pg_wchar strings, which aren't >> necessarily the same representation that the OS' wide-char functions >> expect. If we could guarantee compatibility then the above plan >> would make sense ... > it seems to me, that is possible for UTF8 encoding. Why? The one thing that a wchar certainly is not is UTF8. It might be that the <wctype.h> functions are expecting UTF16 or UTF32, but we don't know which, and really we can hardly even be sure they're expecting Unicode at all. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |