RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Thu Sep 12 2002 - 12:33:18 EDT

Previous message: Peter_Constable@sil.org: "Re: ISRI SoEuro has just been created!!"
In reply to: Marco Cimarosti: "RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Next in thread: Philippe de Rochambeau: "Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Reply: Philippe de Rochambeau: "Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

For some reason I don't the see the original email, so I'm going to guess based on Marco's response below.

The code below is nearly correct, assuming that the starting point was that each UTF-8 byte was converted into a single java.lang.Character object in the String. That is, if the String contained the sequence U+00E8 U+00AA U+009E..., the code would be:

byte[] byt = myString.getBytes("ISO8859_1"); // get the original UTF-8 bytes back
String ucs2 = new String(byt, "UTF-8"); // turn them into a real UCS-2 string

It is very important to name the encoding in the string constructor, otherwise the String constructor assumes the JVM's file.encoding---> most of the time.

There is a annoying bug/feature in some JVMs on real Asian Windows (including 2K and XP) in which the file.encoding is ignored in favor of the actual System Active code page (SYS_ACP) and setting the -Dfile.encoding="someEncoding" doesn't work to change the String constructor's default behavior. You have to be careful always name the encoding, not just rely on the system to provide it.

If your original byte[] is in a real CJK encoding, then you need to name that encoding instead of UTF-8 above (and you can do that by getting the file.encoding system parameter if you are running on the same platform, la so:

byte[] byt = myString.getBytes("ISO8859_1");
String ucs2 = new String(byt, System.getParameter("file.encoding"));

If the original byte[] is actually correctly formed and you want to get UTF-8, Marco's code is correct:

byte[] utf8bytes = myString.getBytes("UTF-8");

Note that I have omitted try/catch blocks for clarity, but the compiler will insist on them...

Hope that helps.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Marco Cimarosti
> Sent: Thursday, September 12, 2002 4:51 AM
> To: 'pr1@club-internet.fr'; unicode@unicode.org
> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>
>
> Philippe de Rochambeau wrote:
> > On the other hand, if I store the previous "go" character
> > plus an unusual
> > CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
> > in the DB and retrieve the data, JRun 3.1 will only display the first
> > character in my form's textarea, plus a few invisible
> > characters, and the
> > database will contain the following hex values:
> >
> > E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
> >
> > As you can see, "go" is still there, but the following
> > character (E5 3F B9)
> > is not \u5439 (E5 90 B9). I cannot figure out how to fix this problem.
> >
> > Any help with this problem would be much appreciated.
>
> I see what the problem is. As usual, it's all the fault of Bill Gate$. :-)
>
> If you interpret <E5, 90, B9> according to Windows-1252, you see
> that E5 is
> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters are
> normally turned into a question marks, and "?"'s code is (guess
> what) 3F...
>
> <E8, AA, 9E> this works only by chance, because all three bytes are valid
> Windows-1252 characters: "é", "ª", and "ž", respectively.
>
> I guess that the problem starts when you try to fool the system into
> thinking that the text is ISO 8859-1:
>
> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
> String tempUtf16 = new String( byt );
>
> But, sorry. I can't help with a fix, because I don't know Java API's well
> enough.
>
> Can't you do something like <.getBytes("UTF-8")>? Or, even better, doesn't
> (newQfLibelleArray[i]) have a method to return a <String> object directly?
>
> _ Marco
>
>
>
>

Previous message: Peter_Constable@sil.org: "Re: ISRI SoEuro has just been created!!"
In reply to: Marco Cimarosti: "RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Next in thread: Philippe de Rochambeau: "Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Reply: Philippe de Rochambeau: "Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Sep 12 2002 - 13:22:14 EDT