Re: CodePage Information

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 22 2003 - 06:31:22 EDT

  • Next message: Abdij Bhat: "RE: CodePage Information"

    I forgot to say you in my previous message that UTF-8 is FULLY compatible with ASCII (i.e. it encodes all ASCII strings using single bytes equal to their ASCII/Unicode codepoint).
    This is really the encoding you need to support Unicode in a language-neutral environment...
    This encoding is so simple, that for Western European languages, using the Windows ANSI 1252 codepage which is *mostly* compatible with ISO-8859-1 (with the exception of byte codes 0x80 to 0x9F), the conversion of ANSI characters in range 0xA0..0xFF to UTF-8 can be simple shift operations to produce a character pair:
    - characters in (0xA0..0xBF) will be converted to pairs in (0xC2; 0xA0..0xBF)
    - characters in (0xC0..0xFF) will be converted to pairs in (0xC3; 0x80..0xBF)
    All other Unicode codepoints will use other leading codes in (0xC4..0xBF) and one to 3 trailing bytes in (0x80..0xBF).

    Note that you don't need any leading BOM for UTF-8, as byte ordering is not relevant for this byte encoding scheme, whose ordering is well defined and fixed.

    The only codepoint in Unicode that generates a NULL byte in UTF-8 is the null codepoint for the NULL ASCII character.
    If you want to store a NUL ASCII in your serialization (so that null Unicode codepoints will be preserved by the encoding), you may use an exception, by escaping it (like does Java internally in the JNI interface).

    This is NOT allowed in UTF-8 but is a trivial extension, used also in the alternate CESU-8 encoding (which is an encoding "scheme" "similar" to UTF-8, except that it is derived from the UTF-16 encoding "form", instead of the UTF-32 encoding "form"): encode a NULL codepoint with the pair of bytes (0xC0; 0x80).

    If you don't want to use a "strict" decoder and allow fallbacks, you may decode UTF-8 and CESU-8 to codepoints with exactly the same decoder function (the difference between UTF-8 and CESU-8 only appears when encoding Unicode codepoints out of the BMP (i.e. in 0x10000..0x10FFFF): UTF-8 uses the numeric codepoints directly before encoding it, but CESU-8 uses the intermediate encoding in UTF-16 of these characters as surrogate pairs (with a leading "code unit" in 0xD800..0xDBFF called "high surrogate", and a trailaing code unit in 0xDC00..0xDFFF called "low surrogate), and then CESU-8 encodes these surrogates individually using the same algorithm as UTF-8.



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 07:30:55 EDT