RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Thu Sep 12 2002 - 22:12:02 EDT

Previous message: Stefan Persson: "UCS-2 and UTF-16"
In reply to: Philippe de Rochambeau: "Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Phillippe,

UTF-16 is (kind of) UCS-2...

What's your system code page? System.out.println uses your system code page
to display characters--it does an implicit conversion. To check your code,
try this:

char[] c = myUCSString.toCharArray();
for (int x=0; x<c.length; x++) {
System.out.print(Integer.toHexString((int)c[x]) + " ");
}

This will show you the actual hex values of the characters (as a string).

I should note that UTF-8 isn't a valid character set for SQL Server. You
need to use the nvarchar/nchar data type for your database to store Unicode.
You can't choose UTF-8 as the code page for SQL Server 2000. Storing UTF-8
in your SQL Server is a recipe for problems (especially since you MUST not
use code page 1252 to lie to the database).

I have more information on encodings in databases in this whitepaper (from
Unicode Conference 19): http://www.inter-locale.com/IUC19.pdf

Hope that helps.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Philippe de Rochambeau
> Sent: Thursday, September 12, 2002 1:09 PM
> To: Addison Phillips [wM]
> Cc: marco.cimarosti@essetre.it; unicode@unicode.org
> Subject: Re: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>
>
> Hello,
>
> > String ucs2 = new String(byt, "UTF-8"); // turn them into a real
> > UCS-2 string
>
> Isn't UCS-2, UTF-16?
>
> > byte[] byt = myString.getBytes("ISO8859_1"); // get the original
> > UTF-8 bytes back
> > String ucs2 = new String(byt, "UTF-8"); // turn them into a real
> > UCS-2 string
>
> If I do the above, I get the questions marks back, whether I display
> the data this way
>
> out.println( rfLibelle );
>
> or that way
>
> out.println( new String( rfLibelle.getBytes(), "UTF-8" ) );
>
> I think that is something wrong with either JRun 3.1, Windows 2000 or
> SQL Server 2000 (or a combination of them).
>
> I don't any problems with Tomcat 4 + PostgreSQL on MacOSX.
>
> Best regards,
>
> Philippe de Rochambeau
>
> Le jeudi, 12 sep 2002, à 18:33 Europe/Paris, Addison Phillips [wM] a
> écrit :
>
> > For some reason I don't the see the original email, so I'm going to
> > guess based on Marco's response below.
> >
> > The code below is nearly correct, assuming that the starting point was
> > that each UTF-8 byte was converted into a single java.lang.Character
> > object in the String. That is, if the String contained the sequence
> > U+00E8 U+00AA U+009E..., the code would be:
> >
> > byte[] byt = myString.getBytes("ISO8859_1"); // get the original
> > UTF-8 bytes back
> > String ucs2 = new String(byt, "UTF-8"); // turn them into a real
> > UCS-2 string
> >
> > It is very important to name the encoding in the string constructor,
> > otherwise the String constructor assumes the JVM's file.encoding--->
> > most of the time.
> >
> > There is a annoying bug/feature in some JVMs on real Asian Windows
> > (including 2K and XP) in which the file.encoding is ignored in favor
> > of the actual System Active code page (SYS_ACP) and setting the
> > -Dfile.encoding="someEncoding" doesn't work to change the String
> > constructor's default behavior. You have to be careful always name the
> > encoding, not just rely on the system to provide it.
> >
> > If your original byte[] is in a real CJK encoding, then you need to
> > name that encoding instead of UTF-8 above (and you can do that by
> > getting the file.encoding system parameter if you are running on the
> > same platform, la so:
> >
> > byte[] byt = myString.getBytes("ISO8859_1");
> > String ucs2 = new String(byt, System.getParameter("file.encoding"));
> >
> > If the original byte[] is actually correctly formed and you want to
> > get UTF-8, Marco's code is correct:
> >
> > byte[] utf8bytes = myString.getBytes("UTF-8");
> >
> > Note that I have omitted try/catch blocks for clarity, but the
> > compiler will insist on them...
> >
> > Hope that helps.
> >
> > Best Regards,
> >
> > Addison
> >
> > Addison P. Phillips
> > Director, Globalization Architecture
> > webMethods, Inc.
> > 432 Lakeside Drive
> > Sunnyvale, California, USA
> > +1 408.962.5487 (phone)
> > +1 408.210.3569 (mobile)
> > -------------------------------------------------
> > Internationalization is an architecture.
> > It is not a feature.
> >
> >> -----Original Message-----
> >> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> >> Behalf Of Marco Cimarosti
> >> Sent: Thursday, September 12, 2002 4:51 AM
> >> To: 'pr1@club-internet.fr'; unicode@unicode.org
> >> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
> >> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
> >>
> >>
> >> Philippe de Rochambeau wrote:
> >>> On the other hand, if I store the previous "go" character
> >>> plus an unusual
> >>> CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8)
> >>> in the DB and retrieve the data, JRun 3.1 will only display the first
> >>> character in my form's textarea, plus a few invisible
> >>> characters, and the
> >>> database will contain the following hex values:
> >>>
> >>> E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
> >>>
> >>> As you can see, "go" is still there, but the following
> >>> character (E5 3F B9)
> >>> is not \u5439 (E5 90 B9). I cannot figure out how to fix this
> >>> problem.
> >>>
> >>> Any help with this problem would be much appreciated.
> >>
> >> I see what the problem is. As usual, it's all the fault of Bill
> >> Gate$. :-)
> >>
> >> If you interpret <E5, 90, B9> according to Windows-1252, you see
> >> that E5 is
> >> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters
> >> are
> >> normally turned into a question marks, and "?"'s code is (guess
> >> what) 3F...
> >>
> >> <E8, AA, 9E> this works only by chance, because all three bytes are
> >> valid
> >> Windows-1252 characters: "é", "ª", and "ž", respectively.
> >>
> >> I guess that the problem starts when you try to fool the system into
> >> thinking that the text is ISO 8859-1:
> >>
> >> byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
> >> String tempUtf16 = new String( byt );
> >>
> >> But, sorry. I can't help with a fix, because I don't know Java API's
> >> well
> >> enough.
> >>
> >> Can't you do something like <.getBytes("UTF-8")>? Or, even better,
> >> doesn't
> >> (newQfLibelleArray[i]) have a method to return a <String> object
> >> directly?
> >>
> >> _ Marco
> >>
> >>
> >>
> >>
> >
> >
>
>
>

Previous message: Stefan Persson: "UCS-2 and UTF-16"
In reply to: Philippe de Rochambeau: "Re: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Sep 12 2002 - 22:51:43 EDT