RE: utf-8 to ucs-2

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Wed Jun 19 2002 - 11:26:31 EDT

Previous message: Stefan Persson: "Re: What are the default CJK encodings for Windows?"
In reply to: Michael \(michka\) Kaplan: "Re: utf-8 to ucs-2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I can't second MichKa's advice enough: don't lie to your database!

As for your problem with Java, it's pretty straightforward.

The issue here is that your UTF-8 bytes are probably stored as separate characters in the nvarchar column. Let's take an example:

The Euro symbol (U+20AC) is 0xE2 0x82 0xAC in UTF-8. When these bytes were stored in the database, they were probably stored as separate characters, e.g. U+00E2 U+0082 U+00AC.

When you retrieve these using Java via JDBC, you will get a String object. A String object is always encoded as UCS-2 (UTF-16), so the characters are U+00E2 U+0082 U+00AC. This means that SQL Server nvarchar uses the same representation as Java String, which is lucky, because the bytes won't have "moved around" due to differences in encoding between Java and the database.

In Java, UTF-8, like all non-UCS-2 encodings, is treated as a byte[]. What you want to do is turn the UCS-2 characters in your String into a UTF-8 byte[] and then that byte[] back into a string. What you need is a transformation that won't move any of the bytes around. There is an encoding that maps 0->FF linearly from Unicode. It's ISO-8859-1 (Latin-1).

So your code should look something like this:

try {
   String ucs2String = new String(dbString.getBytes("ISO8859_1"), "UTF-8");
} catch (UnsupportedEncodingException uex) {
   // encoding not supported, should never get here
} catch (IOException iox) {
  // you will get here if the original sequence wasn't UTF-8 or ASCII
}

The String getBytes method uses the supplied encoding to convert the String to a byte[]. In this case, the byte[] contains 0xE2 0x82 0xAC (the original UTF-8 bytes). The string constructor shown then uses the UTF-8 encoding to interpret the bytes, returning a new String containing just one character (U+20AC in this case).

Hope that helps.

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone) +1 408.210.3659 (mobile)
-------------------------------------------------
Internationalization is an architecture. It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org]On Behalf Of Michael (michka) Kaplan
> Sent: 2002年6月18日 23:40
> To: unicode@unicode.org
> Subject: Re: utf-8 to ucs-2
>
>
> On the whole, it is really a *bad* idea to store UTF-8 data in an MSSQLS
> nvarchar column. The only thing I would really suggest is that you get the
> data out via whatever means you used to get it in, then make some quick
> MultiByteToWideChar calls to convert the data.
>
> SQL Server itself does not provide tools here -- this is the
> disadvantage of
> lying to a database engine (you can't expect much help from it later!).
>
>
> MichKa
>
> Michael Kaplan
> Trigeminal Software, Inc. -- http://www.trigeminal.com/
>
>
> ----- Original Message -----
> From: "Paul Hastings" <paul@tei.or.th>
> To: <unicode@unicode.org>
> Sent: Tuesday, June 18, 2002 10:55 PM
> Subject: utf-8 to ucs-2
>
>
> > since there were so many translation questions this week i
> > guess one more won't hurt. i have a bunch of text data,
> > utf-8 encoded, stored in sql server nvarchar columns.
> > data was inserted using coldfusion 5 which really didn't
> > support unicode (hence the utf-8 encoding). i now need
> > to xfer this data to a cfmx (coldfusion 6) system that
> > wants ucs-2 encoding (cf is now java based & uses
> > merant jdbc drivers).
> >
> > i've been playing around with java string class getBytes
> > method but i can't seem to get it to understand that the
> > input really is utf-8 (yes, i'm a java novice).
> >
> > would anyone point me towards some info/resources
> > that might help? any advice/suggestions also welcome.
> >
> > thanks.
> > ----------------------------------------------------
> > Paul Hastings paul@tei.or.th
> > Director Environmental Information Center
> > Thailand Environment Institute
> > Member Team Macromedia (Allaire)
> > http://www.tei.or.th/eic ---------------------------
> >
> >
> >
> >
>
>
>

Previous message: Stefan Persson: "Re: What are the default CJK encodings for Windows?"
In reply to: Michael \(michka\) Kaplan: "Re: utf-8 to ucs-2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Jun 19 2002 - 10:51:00 EDT