Unix Technical Forum

utf8 issue

This is a discussion on utf8 issue within the Pgsql General forums, part of the PostgreSQL category; --> Hello everybody. I recently converted my db from ASCII encoding to UTF8 (we have a lot of spanish-speaking members, ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-10-2008, 12:20 AM
Tom Hart
 
Posts: n/a
Default utf8 issue

Hello everybody. I recently converted my db from ASCII encoding to UTF8
(we have a lot of spanish-speaking members, and need the extra character
support). Everything was working great, but I noticed this error, while
trying to COPY one of our tables from a csv.

ERROR: invalid byte sequence for encoding "UTF8": 0xb9
SQL state: 22021
Hint: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
Context: COPY transaction_import, line 59358

Is there anything I can do to look for these sorts of errors and fix
them? I'm really not great with character encodings and I'm not sure
where to go on this. Any help?

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-10-2008, 12:20 AM
Richard Huxton
 
Posts: n/a
Default Re: utf8 issue

Tom Hart wrote:
> Hello everybody. I recently converted my db from ASCII encoding to UTF8
> (we have a lot of spanish-speaking members, and need the extra character
> support). Everything was working great, but I noticed this error, while
> trying to COPY one of our tables from a csv.
>
> ERROR: invalid byte sequence for encoding "UTF8": 0xb9


Not converted, I fear.

Why not use latin9 rather than utf8 - that should cover everything and
is probably what it's in anyway.

> Is there anything I can do to look for these sorts of errors and fix
> them? I'm really not great with character encodings and I'm not sure
> where to go on this. Any help?


Google a bit for iconv (and postgresql) to get your characters valid.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-10-2008, 12:20 AM
Alvaro Herrera
 
Posts: n/a
Default Re: utf8 issue

Tom Hart wrote:
> Hello everybody. I recently converted my db from ASCII encoding to UTF8
> (we have a lot of spanish-speaking members, and need the extra character
> support). Everything was working great, but I noticed this error, while
> trying to COPY one of our tables from a csv.
>
> ERROR: invalid byte sequence for encoding "UTF8": 0xb9
> SQL state: 22021
> Hint: This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by
> "client_encoding".
> Context: COPY transaction_import, line 59358


Did you try setting client_encoding beforehand? I think a simple

PGOPTIONS="client_encoding=latin9" pg_restore ...

should suffice.

Now, if you have mixed UTF8 and Latin1/Latin9 data, you are hosed
(meaning you'll have to wade through the stuff and figure out what data
is in what encoding, and fix it).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-10-2008, 12:20 AM
Tom Hart
 
Posts: n/a
Default Re: utf8 issue

Richard Huxton wrote:
> Tom Hart wrote:
>> Hello everybody. I recently converted my db from ASCII encoding to
>> UTF8 (we have a lot of spanish-speaking members, and need the extra
>> character support). Everything was working great, but I noticed this
>> error, while trying to COPY one of our tables from a csv.
>>
>> ERROR: invalid byte sequence for encoding "UTF8": 0xb9

>
> Not converted, I fear.
>
> Why not use latin9 rather than utf8 - that should cover everything and
> is probably what it's in anyway.

I was specifically requested to use UTF-8, and the work's already been
done, so converting again is not my favorite option.
>
>> Is there anything I can do to look for these sorts of errors and fix
>> them? I'm really not great with character encodings and I'm not sure
>> where to go on this. Any help?

>
> Google a bit for iconv (and postgresql) to get your characters valid.
>

I already have a php script that does some data scrubbing before the
copy. I added this line to the script and things seem to be working
better now

$line = iconv("ISO-8859-1", "UTF-8", $line);

Thanks for the help guys :-)

--
Tom Hart
IT Specialist
Cooperative Federal
723 Westcott St.
Syracuse, NY 13210
(315) 471-1116 ext. 202
(315) 476-0567 (fax)


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-10-2008, 12:20 AM
Dean Gibson
 
Posts: n/a
Default Re: utf8 issue

On 2008-02-26 13:04, Tom Hart wrote:
>>

> I already have a php script that does some data scrubbing before the
> copy. I added this line to the script and things seem to be working
> better now
>
> $line = iconv("ISO-8859-1", "UTF-8", $line);
>
> Thanks for the help guys :-)
>


Read up on the difference between PostgreSQL's server_encoding and
client_encoding.

The "server_encoding" is how the data is stored in the server, and can
be anything compatible (UTF-8, ISO-8859-1, whatever will hold your
character set).

The "client_encoding" is how the incoming (or outgoing) data is
treated/assumed. PostgreSQL does the necessary conversion for you.

You can set/change the "client_encoding" in so many ways, it gives you
total flexibility, in order of increasing priority:

1. You can set it as the default for any database (see ALTER DATABASE ...).
2. You can set it in an environment variable, which means the client
utilities (and I believe the libraries) use that.
3. In PSQL, you can set it with the "\encoding" statement (which applies
to the session or until changed), or the "SET [SESSION | LOCAL ]
client_encoding TO ...", which will set it for the session or just the
current transaction.

I just went through this, and while I initially used "iconv" to get up
and running, I've removed most of those in my scripts and just use the
PostgreSQL conversion instead.

--
Mail to my list address MUST be sent via the mailing list.
All other mail to my list address will bounce.


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:19 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com