Unix Technical Forum

Three-byte Unicode characters

This is a discussion on Three-byte Unicode characters within the pgsql Hackers forums, part of the PostgreSQL category; --> [ This email to hackers from last night got lost so I am remailing.] Tom Lane wrote: > "John ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 04:22 AM
Bruce Momjian
 
Posts: n/a
Default Three-byte Unicode characters


[ This email to hackers from last night got lost so I am remailing.]

Tom Lane wrote:
> "John Hansen" <john@geeknet.com.au> writes:
> >> That is backpatched to 8.0.X. Does that not fix the problem reported?

>
> > No, as andrew said, what this patch does, is allow values > 0xffff and
> > at the same time validates the input to make sure it's valid utf8.

>
> The impression I get is that most of the 'Unicode characters above
> 0x10000' reports we've seen did not come from people who actually needed
> more-than-16-bit Unicode codepoints, but from people who had screwed up
> their encoding settings and were trying to tell the backend that Latin1
> was Unicode or some such. So I'm a bit worried that extending the
> backend support to full 32-bit Unicode will do more to mask encoding
> mistakes than it will do to create needed functionality.
>
> Not that I'm against adding the functionality. I'm just doubtful that
> the reports we've seen really indicate that we need it, or that adding
> it will cut down on the incidence of complaints :-(


OK, I got on the IRC server and talked to folks who actually understand
this. They say there are Chinese who are reporting this problem, so I
Googled and found this:

http://www.yale.edu/chinesemac/pages...g.html#Unicode

See the paragraph with "Supplementary Ideographic Plane". You will see
that paragraph says:

The Supplementary Ideographic Plane (SIP) currently contains 42,711
additional characters in "CJK Unified Ideographs Extension B"
(U+20000-2A6D6). The PDF chart for this is available at:
http://www.unicode.org/charts/PDF/U20000.pdf

I assume it is that U+20000-2A6D6 range that people are complaining
about.

So, we do have a bug, and we are probably going to need to fix it in
8.0.X.

I apologize to people who reported this problem and I wasn't attentive
to the seriousness of it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 04:22 AM
Peter Eisentraut
 
Posts: n/a
Default Re: Three-byte Unicode characters

Bruce Momjian wrote:
> So, we do have a bug, and we are probably going to need to fix it in
> 8.0.X.


This has never worked in all the years we have had Unicode
functionality, so I don't understand why we have to rush to fix it now.
Certainly, it ought to be fixed, but not in a minor release.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 04:22 AM
Tom Lane
 
Posts: n/a
Default Re: Three-byte Unicode characters

Peter Eisentraut <peter_e@gmx.net> writes:
> Bruce Momjian wrote:
>> So, we do have a bug, and we are probably going to need to fix it in
>> 8.0.X.


> This has never worked in all the years we have had Unicode
> functionality, so I don't understand why we have to rush to fix it now.
> Certainly, it ought to be fixed, but not in a minor release.


The reasons why we rejected applying John's patch at the tail end
of the 8.0 cycle are still valid: it is a new feature and there
is nontrivial risk of introducing new bugs (more specifically,
exposing bits of the system that aren't prepared for more-than-16-bit
characters).

I'm fine with changing it in the 8.1 cycle, but I think a back-patch
would be folly.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 04:22 AM
Marc G. Fournier
 
Posts: n/a
Default Re: Three-byte Unicode characters

On Sun, 10 Apr 2005, Peter Eisentraut wrote:

> Bruce Momjian wrote:
>> So, we do have a bug, and we are probably going to need to fix it in
>> 8.0.X.

>
> This has never worked in all the years we have had Unicode
> functionality, so I don't understand why we have to rush to fix it now.
> Certainly, it ought to be fixed, but not in a minor release.


Agreed ... this is extending an existing feature to include a broader
charset, not fixing a but ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:38 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com