Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

From: pr1@club-internet.fr
Date: Thu Sep 12 2002 - 04:14:34 EDT


Hello,

I am having problems converting from UTF-8 to UCS-2 and vice-versa
using JRun 3.1 as servlet/JSP engine, SQL Server 2000 as database,
Windows 2000 as OS, and Java 3.1 as programming language.

Some Asian characters are correctly stored in the database and
displayed onscreen while others are not.

For instance, if I copy/paste the Japanese Kanji "go", whose Unicode
value is \u8A9E, from an MS-Word document in Japanese, into an
HTML form that is displayed using the UTF-8 character set, and save it
to my database using the following Java code:

byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
String tempUtf16 = new String( byt );

<where newQLLibelleArray[i] contains the "go" character>
<ISO8859_1 is JRun 3.1's default encoding>
                                                
reponsefaq.updateReponseFaq( rfReponseFaqIdArray[i].intValue(),
rfQuestionFaqIdArray[i].intValue(), tempUtf16,
newRfActiveArray[i].intValue() );

the database will contain the following characters in hexadecimal:

E8 AA 9E 20 20 0D 0A 0A

which match more or less \u8A9E's value in UTF-8, which is E8 AA 9E. I
am not sure why the character was stored in UTF-8 although SQL
Server 2000's native charset is UCS-2 and the character was stored in
UTF-16 ("tempUtf16 = new String( byt )").

The character back will be displayed correctly if I convert it to UTF-8
using the following code

out.println( new String( rfLibelle.getBytes(), "UTF-8" ) );

On the other hand, if I store the previous "go" character plus an unusual
CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
in the DB and retrieve the data, JRun 3.1 will only display the first
character in my form's textarea, plus a few invisible characters, and the
database will contain the following hex values:

E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A

As you can see, "go" is still there, but the following character (E5 3F B9)
is not \u5439 (E5 90 B9). I cannot figure out how to fix this problem.

Any help with this problem would be much appreciated.

Best regards,

Philippe de Rochambeau



This archive was generated by hypermail 2.1.2 : Thu Sep 12 2002 - 04:56:27 EDT