Re: Unicode character transformation through XSLT

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Mar 12 2003 - 17:38:42 EST

  • Next message: jameskass@att.net: "Re: Ligatures fj etc (from Re: Ligatures (qj) )"

    Generally, try instantiating an InputStreamReader or similar from your input, with an explicit
    encoding="UTF8". That will perform the conversion from UTF-8 to the internal 16-bit Unicode that
    Java processes.

    Always use XYZReader classes for text input and XYZWriter classes for text output.

    java.sun.com has tutorials on Internationalization etc. that I recommend.
    See also http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

    Your code takes UTF-8 byte values, mis-casts them to signed then unsigned 16-bit values and
    re-interprets these mistreated UTF-8 byte values as if they were 16-bit UTF-16 code units.

    Let's take this line by line to see what happens:

    Jain, Pankaj (MED, TCS) wrote:
    > Here is my code..
    >
    > while(rsResult.next())
    > {
    > /*Get the file contents from the value column*/
    > ipStream = rsResult.getBinaryStream("VALUE");

    This is the source of the problem. You read the input as binary instead of as UTF-8 text.

    > strBuf = new StringBuffer();
    > while((chunk = ipStream.read())!=-1)
    > {
    > byte byChunk = new Integer(chunk).byteValue();

    Now you get one byte at a time. In Java, byte is a signed type, so 0x80..0xff are actually negative
    values: 0x80=-128 .. 0xff=-1.

    > strBuf.append((char) byChunk);

    This widens the signed integer value to 16 bits and then casts it to an unsigned 16-bit unit (Java
    char is 16 bits wide). 0x80 became negative (-128), was widened to 16 bits and cast to unsigned,
    which is 0xff80. You append this mistreated value to a StringBuffer which reinterprets it as a
    UTF-16 code unit.

    > }
    > prop.setProperty(rsResult.getString("KEY"), strBuf.toString());
    > }

    markus



    This archive was generated by hypermail 2.1.5 : Wed Mar 12 2003 - 18:32:52 EST