RE: Unicode character transformation through XSLT

From: Jain, Pankaj (MED, TCS) (Pankaj.Jain@med.ge.com)
Date: Wed Mar 12 2003 - 12:23:48 EST

Next message: Dominikus Scherkl: "RE: sorting order between win98/xp"

Previous message: Simon Josefsson: "Unicode library that provides versioned Unicode API?"
Maybe in reply to: Jain, Pankaj (MED, TCS): "Unicode character transformation through XSLT"
Next in thread: Pim Blokland: "Re: Unicode character transformation through XSLT"
Reply: Pim Blokland: "Re: Unicode character transformation through XSLT"
Reply: Markus Scherer: "Re: Unicode character transformation through XSLT"
Reply: Yung-Fong Tang: "Re: Unicode character transformation through XSLT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi ftang/james..

thanks for the details explanation. and now I the root problem of my
error.

I have following string is in database as Long in which the special
character(?) is equivalent to ndash(-)

E8C ? 6 to 10

And i am using following code to write the string from database to
property file, and in property file i am getting following string.

value= E8C \uFFE2\uFF80\uFF93 6 to 10

And as \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not able to
figure out why it is coming in property file.

Do we need to specify in my java program any type of encoding like
utf-8.

pls let me know where is the problem.

Here is my code..

while(rsResult.next())

{

/*Get the file contents from the value column*/

ipStream = rsResult.getBinaryStream("VALUE");

strBuf = new StringBuffer();

while((chunk = ipStream.read())!=-1)

{

byte byChunk = new Integer(chunk).byteValue();

strBuf.append((char) byChunk);

}

prop.setProperty(rsResult.getString("KEY"), strBuf.toString());

}

/*Write to o/p stream*/

//opFile = new FileOutputStream(strFileName+".properties");

opFile = new FileOutputStream(strFileName);

/*Store the Properties files*/

prop.store(opFile, "Resource Bundle created from Database View
"+vctView.get(i));

Thnaks

-Pankaj

-----Original Message-----
From: ftang@netscape.com [mailto:ftang@netscape.com]
Sent: Tuesday, March 11, 2003 6:09 PM
To: Jain, Pankaj (MED, TCS)
Cc: 'jameskass@att.net'; 'unicode@unicode.org'
Subject: Re: Unicode character transformation through XSLT

Because the following code got apply to your unicode data

1. convert \u to unicode -
\uFFE2\uFF80\uFF93
become
three unicode characters-
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8 bits got apply to your code" so
it became 3 bytes
E2 80 93

3. and some code treat it as UTF-8 and try to convert it to UCS2 again,
so

E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
U+2013 is EN DASH

so... in your code there are something very very bad which will corrupt
your data.
Step 2 and 3 are very bad. You probably need to find out where they are
and remove that code.

read my paper on
http://people.netscape.com/ftang/paper/textintegrity.html
<http://people.netscape.com/ftang/paper/textintegrity.html>
Probably your Java code have one or two bugs which listed in my paper.

Jain, Pankaj (MED, TCS) wrote:

James,

thanks, its working for me now.

But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in

html.

if you have any information on this, than pls let me know.

Thanks

-Pankaj

-----Original Message-----

From: jameskass@att.net <mailto:jameskass@att.net> [
mailto:jameskass@att.net <mailto:jameskass@att.net> ]

Sent: Monday, March 10, 2003 7:59 PM

To: Jain, Pankaj (MED, TCS)

Cc: ' unicode@unicode.org <mailto:unicode@unicode.org> '

Subject: Re: Unicode character transformation through XSLT

Pankaj Jain wrote,

My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)

from resource bundle property file which is equivalent to ndash(-) and

its

U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three

hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign

\uFF80 is half width katakana letter ta

and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font

you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with

\u2013 ?

Best regards,

James Kass

Next message: Dominikus Scherkl: "RE: sorting order between win98/xp"
Previous message: Simon Josefsson: "Unicode library that provides versioned Unicode API?"
Maybe in reply to: Jain, Pankaj (MED, TCS): "Unicode character transformation through XSLT"
Next in thread: Pim Blokland: "Re: Unicode character transformation through XSLT"
Reply: Pim Blokland: "Re: Unicode character transformation through XSLT"
Reply: Markus Scherer: "Re: Unicode character transformation through XSLT"
Reply: Yung-Fong Tang: "Re: Unicode character transformation through XSLT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Mar 12 2003 - 13:10:52 EST