Re: Unicode character transformation through XSLT

From: Yung-Fong Tang (ftang@netscape.com)
Date: Thu Mar 13 2003 - 19:54:29 EST

  • Next message: Doug Ewell: "Re: sorting order between win98/xp"

    I have not touch Java for years (probably 5 years) ... so, I could be wrong.

    Jain, Pankaj (MED, TCS) wrote:

    > Hi ftang/james..
    >
    > thanks for the details explanation. and now I the root problem of my
    > error.
    >
    > I have following string is in database as Long in which the special
    > character(?) is equivalent to ndash(-)
    >
    > E8C ? 6 to 10
    >
    > And i am using following code to write the string from database to
    > property file, and in property file i am getting following string.
    >
    > value= E8C \uFFE2\uFF80\uFF93 6 to 10
    >
    > And as \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not able to
    > figure out why it is coming in property file.
    >
    > Do we need to specify in my java program any type of encoding like utf-8.
    >
    > pls let me know where is the problem.
    >
    > Here is my code..
    >
    > while(rsResult.next())
    >
    > {
    >
    > /*Get the file contents from the value column*/
    >
    > ipStream = rsResult.getBinaryStream("VALUE");
    >
    what is rsResult? Blob?
    you probably need to use

    BufferedInputStream

    and

    DataInputStream

     to pipe the InputStream
    and use readChar or readUTF in the InputStream interface instad.
    See http://www.webdeveloper.com/java/java_jj_read_write.html and
    http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF()
    for more info.

    > strBuf = new StringBuffer();
    >
    > while((chunk = ipStream.read())!=-1)
    >
    > {
    >
    > byte byChunk = new Integer(chunk).byteValue();
    >
    > strBuf.append((char) byChunk);
    >
    > }
    >
    Here is your problem, you read it in byte to byte. Each byte of the
    UTF-8 will be read in as a Byte instead of a Char in Java.

    > prop.setProperty(rsResult.getString("KEY"), strBuf.toString());
    >
    > }
    >
    > /*Write to o/p stream*/
    >
    > //opFile = new FileOutputStream(strFileName+".properties");
    >
    > opFile = new FileOutputStream(strFileName);
    >
    > /*Store the Properties files*/
    >
    > prop.store(opFile, "Resource Bundle created from Database View
    > "+vctView.get(i));
    >
    >
    >
    > Thnaks
    >
    > -Pankaj
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > -----Original Message-----
    > From: ftang@netscape.com [mailto:ftang@netscape.com]
    > Sent: Tuesday, March 11, 2003 6:09 PM
    > To: Jain, Pankaj (MED, TCS)
    > Cc: 'jameskass@att.net'; 'unicode@unicode.org'
    > Subject: Re: Unicode character transformation through XSLT
    >
    >
    > Because the following code got apply to your unicode data
    >
    > 1. convert \u to unicode -
    >
    >\uFFE2\uFF80\uFF93
    >
    > become
    > three unicode characters-
    >
    >U+FFE2, U+FF80, U+FF93
    >
    > This is ok
    > 2. a "Throw away hihg 8 bits got apply to your code" so
    > it became 3 bytes
    > E2 80 93
    >
    > 3. and some code treat it as UTF-8 and try to convert it to UCS2
    > again, so
    >
    > E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
    > 80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
    > 93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2
    >
    > [0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
    > U+2013 is EN DASH
    >
    > so... in your code there are something very very bad which will
    > corrupt your data.
    > Step 2 and 3 are very bad. You probably need to find out where
    > they are and remove that code.
    >
    > read my paper on
    > http://people.netscape.com/ftang/paper/textintegrity.html
    > Probably your Java code have one or two bugs which listed in my
    > paper.
    >
    > Jain, Pankaj (MED, TCS) wrote:
    >
    >>James,
    >>thanks, its working for me now.
    >>But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
    >>html.
    >>if you have any information on this, than pls let me know.
    >>
    >>Thanks
    >>-Pankaj
    >>
    >>-----Original Message-----
    >>From: jameskass@att.net [mailto:jameskass@att.net]
    >>Sent: Monday, March 10, 2003 7:59 PM
    >>To: Jain, Pankaj (MED, TCS)
    >>Cc: 'unicode@unicode.org'
    >>Subject: Re: Unicode character transformation through XSLT
    >>
    >>
    >>.
    >>Pankaj Jain wrote,
    >>
    >>
    >>
    >>>My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
    >>>from resource bundle property file which is equivalent to ndash(-) and
    >>>its
    >>>
    >>>
    >>
    >>U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three
    >>hex bytes: E2 80 93.
    >>
    >>But, \uFFE2 is fullwidth pound sign
    >>\uFF80 is half width katakana letter ta
    >>and \uff93 is half width katakana letter mo.
    >>
    >>Perhaps the reason you see three question marks is that the font
    >>you are using doesn't support full width and half width characters.
    >>
    >>What happens if you replace your string \uFFE2\uFF80\uFF93 with
    >>\u2013 ?
    >>
    >>Best regards,
    >>
    >>James Kass
    >>.
    >>
    >>
    >>
    >



    This archive was generated by hypermail 2.1.5 : Thu Mar 13 2003 - 20:37:47 EST