Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

From: Philippe de Rochambeau (pr1@club-internet.fr)
Date: Thu Sep 12 2002 - 16:08:54 EDT


Hello,

> String ucs2 = new String(byt, "UTF-8"); // turn them into a real
> UCS-2 string

Isn't UCS-2, UTF-16?

> byte[] byt = myString.getBytes("ISO8859_1"); // get the original
> UTF-8 bytes back
> String ucs2 = new String(byt, "UTF-8"); // turn them into a real
> UCS-2 string

If I do the above, I get the questions marks back, whether I display
the data this way

out.println( rfLibelle );
                
or that way

out.println( new String( rfLibelle.getBytes(), "UTF-8" ) );

I think that is something wrong with either JRun 3.1, Windows 2000 or
SQL Server 2000 (or a combination of them).

I don't any problems with Tomcat 4 + PostgreSQL on MacOSX.

Best regards,

Philippe de Rochambeau

Le jeudi, 12 sep 2002, à 18:33 Europe/Paris, Addison Phillips [wM] a
écrit :

> For some reason I don't the see the original email, so I'm going to
> guess based on Marco's response below.
>
> The code below is nearly correct, assuming that the starting point was
> that each UTF-8 byte was converted into a single java.lang.Character
> object in the String. That is, if the String contained the sequence
> U+00E8 U+00AA U+009E..., the code would be:
>
> byte[] byt = myString.getBytes("ISO8859_1"); // get the original
> UTF-8 bytes back
> String ucs2 = new String(byt, "UTF-8"); // turn them into a real
> UCS-2 string
>
> It is very important to name the encoding in the string constructor,
> otherwise the String constructor assumes the JVM's file.encoding--->
> most of the time.
>
> There is a annoying bug/feature in some JVMs on real Asian Windows
> (including 2K and XP) in which the file.encoding is ignored in favor
> of the actual System Active code page (SYS_ACP) and setting the
> -Dfile.encoding="someEncoding" doesn't work to change the String
> constructor's default behavior. You have to be careful always name the
> encoding, not just rely on the system to provide it.
>
> If your original byte[] is in a real CJK encoding, then you need to
> name that encoding instead of UTF-8 above (and you can do that by
> getting the file.encoding system parameter if you are running on the
> same platform, la so:
>
> byte[] byt = myString.getBytes("ISO8859_1");
> String ucs2 = new String(byt, System.getParameter("file.encoding"));
>
> If the original byte[] is actually correctly formed and you want to
> get UTF-8, Marco's code is correct:
>
> byte[] utf8bytes = myString.getBytes("UTF-8");
>
> Note that I have omitted try/catch blocks for clarity, but the
> compiler will insist on them...
>
> Hope that helps.
>
> Best Regards,
>
> Addison
>
> Addison P. Phillips
> Director, Globalization Architecture
> webMethods, Inc.
> 432 Lakeside Drive
> Sunnyvale, California, USA
> +1 408.962.5487 (phone)
> +1 408.210.3569 (mobile)
> -------------------------------------------------
> Internationalization is an architecture.
> It is not a feature.
>
>> -----Original Message-----
>> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
>> Behalf Of Marco Cimarosti
>> Sent: Thursday, September 12, 2002 4:51 AM
>> To: 'pr1@club-internet.fr'; unicode@unicode.org
>> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
>> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>>
>>
>> Philippe de Rochambeau wrote:
>>> On the other hand, if I store the previous "go" character
>>> plus an unusual
>>> CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
>>> in the DB and retrieve the data, JRun 3.1 will only display the first
>>> character in my form's textarea, plus a few invisible
>>> characters, and the
>>> database will contain the following hex values:
>>>
>>> E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
>>>
>>> As you can see, "go" is still there, but the following
>>> character (E5 3F B9)
>>> is not \u5439 (E5 90 B9). I cannot figure out how to fix this
>>> problem.
>>>
>>> Any help with this problem would be much appreciated.
>>
>> I see what the problem is. As usual, it's all the fault of Bill
>> Gate$. :-)
>>
>> If you interpret <E5, 90, B9> according to Windows-1252, you see
>> that E5 is
>> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters
>> are
>> normally turned into a question marks, and "?"'s code is (guess
>> what) 3F...
>>
>> <E8, AA, 9E> this works only by chance, because all three bytes are
>> valid
>> Windows-1252 characters: "é", "ª", and "ž", respectively.
>>
>> I guess that the problem starts when you try to fool the system into
>> thinking that the text is ISO 8859-1:
>>
>> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
>> String tempUtf16 = new String( byt );
>>
>> But, sorry. I can't help with a fix, because I don't know Java API's
>> well
>> enough.
>>
>> Can't you do something like <.getBytes("UTF-8")>? Or, even better,
>> doesn't
>> (newQfLibelleArray[i]) have a method to return a <String> object
>> directly?
>>
>> _ Marco
>>
>>
>>
>>
>
>



This archive was generated by hypermail 2.1.2 : Thu Sep 12 2002 - 16:54:22 EDT