RE: CodePage Information

From: Abdij Bhat ([email protected])
Date: Thu May 22 2003 - 06:33:42 EDT

Next message: Winkler, Arnold F: "RE: More of The Unicode Standard, Version 4.0 available online"

Previous message: Abdij Bhat: "RE: CodePage Information"
Maybe in reply to: Abdij Bhat: "CodePage Information"
Next in thread: Rick McGowan: "Re: CodePage Information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Philippe,
Thanks for the detailed information. I will try out the UTF-16 to UTF-8 and
back mechanism and share my success with you.

Thanks and Regards,
Abdij Bhat
Kshema Technologies
mailto:[email protected]
www.kshema.com
Phone:+91 80 860 3600 (Extension 2102)
Fax: +91 80 860 3372

-----Original Message-----
From: Philippe Verdy [mailto:[email protected]]
Sent: Thursday, May 22, 2003 4:01 PM
To: Abdij Bhat
Cc: [email protected]
Subject: Re: CodePage Information

I forgot to say you in my previous message that UTF-8 is FULLY compatible
with ASCII (i.e. it encodes all ASCII strings using single bytes equal to
their ASCII/Unicode codepoint).
This is really the encoding you need to support Unicode in a
language-neutral environment...
This encoding is so simple, that for Western European languages, using the
Windows ANSI 1252 codepage which is *mostly* compatible with ISO-8859-1
(with the exception of byte codes 0x80 to 0x9F), the conversion of ANSI
characters in range 0xA0..0xFF to UTF-8 can be simple shift operations to
produce a character pair:
- characters in (0xA0..0xBF) will be converted to pairs in (0xC2;
0xA0..0xBF)
- characters in (0xC0..0xFF) will be converted to pairs in (0xC3;
0x80..0xBF)
All other Unicode codepoints will use other leading codes in (0xC4..0xBF)
and one to 3 trailing bytes in (0x80..0xBF).

Note that you don't need any leading BOM for UTF-8, as byte ordering is not
relevant for this byte encoding scheme, whose ordering is well defined and
fixed.

The only codepoint in Unicode that generates a NULL byte in UTF-8 is the
null codepoint for the NULL ASCII character.
If you want to store a NUL ASCII in your serialization (so that null Unicode
codepoints will be preserved by the encoding), you may use an exception, by
escaping it (like does Java internally in the JNI interface).

This is NOT allowed in UTF-8 but is a trivial extension, used also in the
alternate CESU-8 encoding (which is an encoding "scheme" "similar" to UTF-8,
except that it is derived from the UTF-16 encoding "form", instead of the
UTF-32 encoding "form"): encode a NULL codepoint with the pair of bytes
(0xC0; 0x80).

If you don't want to use a "strict" decoder and allow fallbacks, you may
decode UTF-8 and CESU-8 to codepoints with exactly the same decoder function
(the difference between UTF-8 and CESU-8 only appears when encoding Unicode
codepoints out of the BMP (i.e. in 0x10000..0x10FFFF): UTF-8 uses the
numeric codepoints directly before encoding it, but CESU-8 uses the
intermediate encoding in UTF-16 of these characters as surrogate pairs (with
a leading "code unit" in 0xD800..0xDBFF called "high surrogate", and a
trailaing code unit in 0xDC00..0xDFFF called "low surrogate), and then
CESU-8 encodes these surrogates individually using the same algorithm as
UTF-8.

Next message: Winkler, Arnold F: "RE: More of The Unicode Standard, Version 4.0 available online"
Previous message: Abdij Bhat: "RE: CodePage Information"
Maybe in reply to: Abdij Bhat: "CodePage Information"
Next in thread: Rick McGowan: "Re: CodePage Information"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 07:42:19 EDT