RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

From: Addison Phillips [wM] (
Date: Thu Sep 12 2002 - 12:33:18 EDT

For some reason I don't the see the original email, so I'm going to guess based on Marco's response below.

The code below is nearly correct, assuming that the starting point was that each UTF-8 byte was converted into a single java.lang.Character object in the String. That is, if the String contained the sequence U+00E8 U+00AA U+009E..., the code would be:

byte[] byt = myString.getBytes("ISO8859_1"); // get the original UTF-8 bytes back
String ucs2 = new String(byt, "UTF-8"); // turn them into a real UCS-2 string

It is very important to name the encoding in the string constructor, otherwise the String constructor assumes the JVM's file.encoding---> most of the time.

There is a annoying bug/feature in some JVMs on real Asian Windows (including 2K and XP) in which the file.encoding is ignored in favor of the actual System Active code page (SYS_ACP) and setting the -Dfile.encoding="someEncoding" doesn't work to change the String constructor's default behavior. You have to be careful always name the encoding, not just rely on the system to provide it.

If your original byte[] is in a real CJK encoding, then you need to name that encoding instead of UTF-8 above (and you can do that by getting the file.encoding system parameter if you are running on the same platform, la so:

byte[] byt = myString.getBytes("ISO8859_1");
String ucs2 = new String(byt, System.getParameter("file.encoding"));

If the original byte[] is actually correctly formed and you want to get UTF-8, Marco's code is correct:

byte[] utf8bytes = myString.getBytes("UTF-8");

Note that I have omitted try/catch blocks for clarity, but the compiler will insist on them...

Hope that helps.

Best Regards,


Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: []On
> Behalf Of Marco Cimarosti
> Sent: Thursday, September 12, 2002 4:51 AM
> To: '';
> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
> Philippe de Rochambeau wrote:
> > On the other hand, if I store the previous "go" character
> > plus an unusual
> > CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
> > in the DB and retrieve the data, JRun 3.1 will only display the first
> > character in my form's textarea, plus a few invisible
> > characters, and the
> > database will contain the following hex values:
> >
> > E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
> >
> > As you can see, "go" is still there, but the following
> > character (E5 3F B9)
> > is not \u5439 (E5 90 B9). I cannot figure out how to fix this problem.
> >
> > Any help with this problem would be much appreciated.
> I see what the problem is. As usual, it's all the fault of Bill Gate$. :-)
> If you interpret <E5, 90, B9> according to Windows-1252, you see
> that E5 is
> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters are
> normally turned into a question marks, and "?"'s code is (guess
> what) 3F...
> <E8, AA, 9E> this works only by chance, because all three bytes are valid
> Windows-1252 characters: "é", "ª", and "ž", respectively.
> I guess that the problem starts when you try to fool the system into
> thinking that the text is ISO 8859-1:
> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
> String tempUtf16 = new String( byt );
> But, sorry. I can't help with a fix, because I don't know Java API's well
> enough.
> Can't you do something like <.getBytes("UTF-8")>? Or, even better, doesn't
> (newQfLibelleArray[i]) have a method to return a <String> object directly?
> _ Marco

This archive was generated by hypermail 2.1.2 : Thu Sep 12 2002 - 13:22:14 EDT