RE: CodePage Information

From: Abdij Bhat (Abdij.Bhat@kshema.com)
Date: Thu May 22 2003 - 06:39:24 EDT

  • Next message: Abdij Bhat: "RE: CodePage Information"

    Hi Phillipe,
     One more thing..
     Can you point me to some UTF-8 to UTF-16 converters (decoders/encoders) and
    vice-versa.

    Thanks and Regards,
    Abdij Bhat
    Kshema Technologies
    mailto:abdij.bhat@kshema.com
    www.kshema.com
    Phone:+91 80 860 3600 (Extension 2102)
    Fax: +91 80 860 3372

    -----Original Message-----
    From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    Sent: Thursday, May 22, 2003 4:01 PM
    To: Abdij Bhat
    Cc: unicode@unicode.org
    Subject: Re: CodePage Information

    I forgot to say you in my previous message that UTF-8 is FULLY compatible
    with ASCII (i.e. it encodes all ASCII strings using single bytes equal to
    their ASCII/Unicode codepoint).
    This is really the encoding you need to support Unicode in a
    language-neutral environment...
    This encoding is so simple, that for Western European languages, using the
    Windows ANSI 1252 codepage which is *mostly* compatible with ISO-8859-1
    (with the exception of byte codes 0x80 to 0x9F), the conversion of ANSI
    characters in range 0xA0..0xFF to UTF-8 can be simple shift operations to
    produce a character pair:
    - characters in (0xA0..0xBF) will be converted to pairs in (0xC2;
    0xA0..0xBF)
    - characters in (0xC0..0xFF) will be converted to pairs in (0xC3;
    0x80..0xBF)
    All other Unicode codepoints will use other leading codes in (0xC4..0xBF)
    and one to 3 trailing bytes in (0x80..0xBF).

    Note that you don't need any leading BOM for UTF-8, as byte ordering is not
    relevant for this byte encoding scheme, whose ordering is well defined and
    fixed.

    The only codepoint in Unicode that generates a NULL byte in UTF-8 is the
    null codepoint for the NULL ASCII character.
    If you want to store a NUL ASCII in your serialization (so that null Unicode
    codepoints will be preserved by the encoding), you may use an exception, by
    escaping it (like does Java internally in the JNI interface).

    This is NOT allowed in UTF-8 but is a trivial extension, used also in the
    alternate CESU-8 encoding (which is an encoding "scheme" "similar" to UTF-8,
    except that it is derived from the UTF-16 encoding "form", instead of the
    UTF-32 encoding "form"): encode a NULL codepoint with the pair of bytes
    (0xC0; 0x80).

    If you don't want to use a "strict" decoder and allow fallbacks, you may
    decode UTF-8 and CESU-8 to codepoints with exactly the same decoder function
    (the difference between UTF-8 and CESU-8 only appears when encoding Unicode
    codepoints out of the BMP (i.e. in 0x10000..0x10FFFF): UTF-8 uses the
    numeric codepoints directly before encoding it, but CESU-8 uses the
    intermediate encoding in UTF-16 of these characters as surrogate pairs (with
    a leading "code unit" in 0xD800..0xDBFF called "high surrogate", and a
    trailaing code unit in 0xDC00..0xDFFF called "low surrogate), and then
    CESU-8 encodes these surrogates individually using the same algorithm as
    UTF-8.



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 07:38:04 EDT