RE: Unicode character transformation through XSLT

From: Jain, Pankaj (MED, TCS) (Pankaj.Jain@med.ge.com)
Date: Wed Mar 12 2003 - 12:23:48 EST

  • Next message: Dominikus Scherkl: "RE: sorting order between win98/xp"

    Hi ftang/james..

    thanks for the details explanation. and now I the root problem of my
    error.

    I have following string is in database as Long in which the special
    character(?) is equivalent to ndash(-)

    E8C ? 6 to 10

    And i am using following code to write the string from database to
    property file, and in property file i am getting following string.

    value= E8C \uFFE2\uFF80\uFF93 6 to 10

    And as \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not able to
    figure out why it is coming in property file.

    Do we need to specify in my java program any type of encoding like
    utf-8.

    pls let me know where is the problem.

    Here is my code..

    while(rsResult.next())

    {

    /*Get the file contents from the value column*/

    ipStream = rsResult.getBinaryStream("VALUE");

    strBuf = new StringBuffer();

    while((chunk = ipStream.read())!=-1)

    {

    byte byChunk = new Integer(chunk).byteValue();

    strBuf.append((char) byChunk);

    }

    prop.setProperty(rsResult.getString("KEY"), strBuf.toString());

    }

    /*Write to o/p stream*/

    //opFile = new FileOutputStream(strFileName+".properties");

    opFile = new FileOutputStream(strFileName);

    /*Store the Properties files*/

    prop.store(opFile, "Resource Bundle created from Database View
    "+vctView.get(i));

     

     

     

    Thnaks

    -Pankaj

     

     

     

     

    -----Original Message-----
    From: ftang@netscape.com [mailto:ftang@netscape.com]
    Sent: Tuesday, March 11, 2003 6:09 PM
    To: Jain, Pankaj (MED, TCS)
    Cc: 'jameskass@att.net'; 'unicode@unicode.org'
    Subject: Re: Unicode character transformation through XSLT

    Because the following code got apply to your unicode data

    1. convert \u to unicode -
    \uFFE2\uFF80\uFF93
     become
    three unicode characters-
    U+FFE2, U+FF80, U+FF93
    This is ok
    2. a "Throw away hihg 8 bits got apply to your code" so
    it became 3 bytes
    E2 80 93

    3. and some code treat it as UTF-8 and try to convert it to UCS2 again,
    so

    E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
    80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
    93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

    [0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
    U+2013 is EN DASH

    so... in your code there are something very very bad which will corrupt
    your data.
    Step 2 and 3 are very bad. You probably need to find out where they are
    and remove that code.

    read my paper on
    http://people.netscape.com/ftang/paper/textintegrity.html
    <http://people.netscape.com/ftang/paper/textintegrity.html>
    Probably your Java code have one or two bugs which listed in my paper.

    Jain, Pankaj (MED, TCS) wrote:

    James,

    thanks, its working for me now.

    But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in

    html.

    if you have any information on this, than pls let me know.

    Thanks

    -Pankaj

    -----Original Message-----

    From: jameskass@att.net <mailto:jameskass@att.net> [
    mailto:jameskass@att.net <mailto:jameskass@att.net> ]

    Sent: Monday, March 10, 2003 7:59 PM

    To: Jain, Pankaj (MED, TCS)

    Cc: ' unicode@unicode.org <mailto:unicode@unicode.org> '

    Subject: Re: Unicode character transformation through XSLT

    .

    Pankaj Jain wrote,

      

    My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)

    from resource bundle property file which is equivalent to ndash(-) and

    its

        

    U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three

    hex bytes: E2 80 93.

    But, \uFFE2 is fullwidth pound sign

    \uFF80 is half width katakana letter ta

    and \uff93 is half width katakana letter mo.

    Perhaps the reason you see three question marks is that the font

    you are using doesn't support full width and half width characters.

    What happens if you replace your string \uFFE2\uFF80\uFF93 with

    \u2013 ?

    Best regards,

    James Kass

    .

      



    This archive was generated by hypermail 2.1.5 : Wed Mar 12 2003 - 13:10:52 EST