Re: Unicode character transformation through XSLT

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Mar 12 2003 - 17:38:42 EST

Next message: jameskass@att.net: "Re: Ligatures fj etc (from Re: Ligatures (qj) )"

Previous message: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
In reply to: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
Next in thread: Yung-Fong Tang: "Re: Unicode character transformation through XSLT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Generally, try instantiating an InputStreamReader or similar from your input, with an explicit
encoding="UTF8". That will perform the conversion from UTF-8 to the internal 16-bit Unicode that
Java processes.

Always use XYZReader classes for text input and XYZWriter classes for text output.

java.sun.com has tutorials on Internationalization etc. that I recommend.
See also http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

Your code takes UTF-8 byte values, mis-casts them to signed then unsigned 16-bit values and
re-interprets these mistreated UTF-8 byte values as if they were 16-bit UTF-16 code units.

Let's take this line by line to see what happens:

Jain, Pankaj (MED, TCS) wrote:
> Here is my code..
>
> while(rsResult.next())
> {
> /*Get the file contents from the value column*/
> ipStream = rsResult.getBinaryStream("VALUE");

This is the source of the problem. You read the input as binary instead of as UTF-8 text.

> strBuf = new StringBuffer();
> while((chunk = ipStream.read())!=-1)
> {
> byte byChunk = new Integer(chunk).byteValue();

Now you get one byte at a time. In Java, byte is a signed type, so 0x80..0xff are actually negative
values: 0x80=-128 .. 0xff=-1.

> strBuf.append((char) byChunk);

This widens the signed integer value to 16 bits and then casts it to an unsigned 16-bit unit (Java
char is 16 bits wide). 0x80 became negative (-128), was widened to 16 bits and cast to unsigned,
which is 0xff80. You append this mistreated value to a StringBuffer which reinterprets it as a
UTF-16 code unit.

> }
> prop.setProperty(rsResult.getString("KEY"), strBuf.toString());
> }

markus

Next message: jameskass@att.net: "Re: Ligatures fj etc (from Re: Ligatures (qj) )"
Previous message: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
In reply to: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
Next in thread: Yung-Fong Tang: "Re: Unicode character transformation through XSLT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Mar 12 2003 - 18:32:52 EST