This is a discussion on Re: Unicode problems on IRC within the pgsql Hackers forums, part of the PostgreSQL category; --> >On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| >On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us> wrote: >> Andrew - Supernews <andrew+nonews ( at ) supernews ( dot ) com> writes: >>> I think you will find that this impression is actually false. Or that at >>> the very least, _correct_ verification of UTF-8 sequences will still >>> catch essentially all cases of non-utf-8 input mislabelled as utf-8 >>> while allowing the full range of Unicode codepoints. >> >> Yeah? Cool. Does John's proposed patch do it "correctly"? >> >> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html > >It looks correct to me. The only thing I think that code will let through >incorrectly are encoded surrogates; those could be fixed by adding one line: > > switch (*source) { > /* no fall-through in this inner switch */ > case 0xE0: if (a < 0xA0) return false; break; >+ case 0xED: if (a > 0x9F) return false; break; > case 0xF0: if (a < 0x90) return false; break; > case 0xF4: if (a > 0x8F) return false; break; > That's right, dono how I missed that one, but looks correct to me, and is in line with the code in ConvertUTF.c from unicode.org, on which I based the patch, extended to support 6 byte utf8 characters. >(Accepting encoded surrogates in utf-8 was always forbidden by most >specifications that used utf-8, though the Unicode specs originally were >not absolute about it (but forbade generating them). Current Unicode >specifications define those sequences as malformed. Surrogates are the >code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode >characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that >such characters are encoded directly rather than via surrogate pairs.) > >-- >Andrew, Supernews >http://www.supernews.com - individual and corporate NNTP services .... John ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| ||||
| On 2005-04-10, "John Hansen" <john@geeknet.com.au> wrote: > That's right, dono how I missed that one, but looks correct to me, and > is in line with the code in ConvertUTF.c from unicode.org, on which I > based the patch, extended to support 6 byte utf8 characters. Frankly, you should probably de-extend it back down to 4 bytes. That's enough to encode the Unicode range of 0x000000 - 0x10FFFF, and enough other stuff would break if anyone allocated a character outside that range that I don't think it it worth worrying about. (Even the ISO people have agreed to conform to that limitation.) Even if insanity struck simultaneously at both standards bodies, 4 bytes is enough to go to 0x1FFFFF so there is still substantial slack. (A number of other specifications based on utf-8 have removed the 5 and 6 byte sequences too, so there is substantial precedent for this.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services |