RE: CodePage Information

From: Abdij Bhat (Abdij.Bhat@kshema.com)
Date: Thu May 22 2003 - 06:39:24 EDT

Next message: Abdij Bhat: "RE: CodePage Information"

Previous message: Philippe Verdy: "Re: CodePage Information"
Maybe in reply to: Abdij Bhat: "CodePage Information"
Next in thread: Philippe Verdy: "Re: CodePage Information"
Reply: Philippe Verdy: "Re: CodePage Information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Phillipe,
One more thing..
Can you point me to some UTF-8 to UTF-16 converters (decoders/encoders) and
vice-versa.

Thanks and Regards,
Abdij Bhat
Kshema Technologies
mailto:abdij.bhat@kshema.com
www.kshema.com
Phone:+91 80 860 3600 (Extension 2102)
Fax: +91 80 860 3372

-----Original Message-----
From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
Sent: Thursday, May 22, 2003 4:01 PM
To: Abdij Bhat
Cc: unicode@unicode.org
Subject: Re: CodePage Information

I forgot to say you in my previous message that UTF-8 is FULLY compatible
with ASCII (i.e. it encodes all ASCII strings using single bytes equal to
their ASCII/Unicode codepoint).
This is really the encoding you need to support Unicode in a
language-neutral environment...
This encoding is so simple, that for Western European languages, using the
Windows ANSI 1252 codepage which is *mostly* compatible with ISO-8859-1
(with the exception of byte codes 0x80 to 0x9F), the conversion of ANSI
characters in range 0xA0..0xFF to UTF-8 can be simple shift operations to
produce a character pair:
- characters in (0xA0..0xBF) will be converted to pairs in (0xC2;
0xA0..0xBF)
- characters in (0xC0..0xFF) will be converted to pairs in (0xC3;
0x80..0xBF)
All other Unicode codepoints will use other leading codes in (0xC4..0xBF)
and one to 3 trailing bytes in (0x80..0xBF).

Note that you don't need any leading BOM for UTF-8, as byte ordering is not
relevant for this byte encoding scheme, whose ordering is well defined and
fixed.

The only codepoint in Unicode that generates a NULL byte in UTF-8 is the
null codepoint for the NULL ASCII character.
If you want to store a NUL ASCII in your serialization (so that null Unicode
codepoints will be preserved by the encoding), you may use an exception, by
escaping it (like does Java internally in the JNI interface).

This is NOT allowed in UTF-8 but is a trivial extension, used also in the
alternate CESU-8 encoding (which is an encoding "scheme" "similar" to UTF-8,
except that it is derived from the UTF-16 encoding "form", instead of the
UTF-32 encoding "form"): encode a NULL codepoint with the pair of bytes
(0xC0; 0x80).

If you don't want to use a "strict" decoder and allow fallbacks, you may
decode UTF-8 and CESU-8 to codepoints with exactly the same decoder function
(the difference between UTF-8 and CESU-8 only appears when encoding Unicode
codepoints out of the BMP (i.e. in 0x10000..0x10FFFF): UTF-8 uses the
numeric codepoints directly before encoding it, but CESU-8 uses the
intermediate encoding in UTF-16 of these characters as surrogate pairs (with
a leading "code unit" in 0xD800..0xDBFF called "high surrogate", and a
trailaing code unit in 0xDC00..0xDFFF called "low surrogate), and then
CESU-8 encodes these surrogates individually using the same algorithm as
UTF-8.

Next message: Abdij Bhat: "RE: CodePage Information"
Previous message: Philippe Verdy: "Re: CodePage Information"
Maybe in reply to: Abdij Bhat: "CodePage Information"
Next in thread: Philippe Verdy: "Re: CodePage Information"
Reply: Philippe Verdy: "Re: CodePage Information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 07:38:04 EDT