Re: CodePage Information

From: Doug Ewell (dewell@adelphia.net)
Date: Thu May 22 2003 - 11:55:19 EDT

  • Next message: Doug Ewell: "Re: Is it true that Unicode is insufficient for Oriental languages?"

    Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

    > Use the very simple UTF-8 encoding scheme (or the even more compact
    > SCSU or BOCU-8 encodings)

    It's called BOCU-1...

    > to serialize Unicode strings to your device,
    > and you won't suffer of the NULL byte problem you have with byte
    > encoding schemes derived from UTF-16 (the encoding you probably use in
    > your GUI)...

    SCSU doesn't guarantee that the resulting byte stream won't contain
    0x00. It isn't used as a tag in single-byte mode, so runs of scripts
    using small alphabets won't see a NULL. But Abdij said his application
    had to support Chinese too, and Chinese text encoded in SCSU would be in
    "Unicode mode" and thus could contain 0x00 bytes.

    Later:

    > All other Unicode codepoints will use other leading codes in
    > (0xC4..0xBF) and one to 3 trailing bytes in (0x80..0xBF).

    The range of leading byte values goes up to 0xF4.

    > Note that you don't need any leading BOM for UTF-8, as byte ordering
    > is not relevant for this byte encoding scheme, whose ordering is well
    > defined and fixed.

    This is only half the story. Some processes find it useful to prefix
    UTF-8 text with a U+FEFF signature, not because of byte ordering, but so
    the text can be easily identified as UTF-8. Recent versions of Windows
    Notepad use the UTF-8 and UTF-16 signatures to distinguish those
    encodings from 8-bit "ANSI" code pages.

    It should be noted that many processes do NOT want to see a leading
    U+FEFF on UTF-8 files. In particular, Unix and Linux systems may expect
    specific bytes at the beginning of files, and the signature would mess
    that up. HTML and XML files also should not contain a signature. So it
    should not be used indiscriminately.

    My point is that the use or non-use of U+FEFF in UTF-8 files has nothing
    to do with byte order. This is a frequently misunderstood point.

    > If you want to store a NUL ASCII in your serialization (so that null
    > Unicode codepoints will be preserved by the encoding), you may use an
    > exception, by escaping it (like does Java internally in the JNI
    > interface).
    >
    > This is NOT allowed in UTF-8 but is a trivial extension, used also in
    > the alternate CESU-8 encoding (which is an encoding "scheme" "similar"
    > to UTF-8, except that it is derived from the UTF-16 encoding "form",
    > instead of the UTF-32 encoding "form"): encode a NULL codepoint with
    > the pair of bytes (0xC0; 0x80).

    This is not allowed anywhere except in internal processing, where
    anything goes. Do not recommend this. (Fortunately, the issue seldom
    comes up in the real world because most people don't need to store
    U+0000 in plain text files.)

    > If you don't want to use a "strict" decoder and allow fallbacks, you
    > may decode UTF-8 and CESU-8 to codepoints with exactly the same
    > decoder function (the difference between UTF-8 and CESU-8 only appears
    > when encoding Unicode codepoints out of the BMP (i.e. in
    > 0x10000..0x10FFFF): UTF-8 uses the numeric codepoints directly before
    > encoding it, but CESU-8 uses the intermediate encoding in UTF-16 of
    > these characters as surrogate pairs (with a leading "code unit" in
    > 0xD800..0xDBFF called "high surrogate", and a trailaing code unit in
    > 0xDC00..0xDFFF called "low surrogate), and then CESU-8 encodes these
    > surrogates individually using the same algorithm as UTF-8.

    Handling of supplementary characters in UTF-8 and CESU-8 is mutually
    exclusive. What is legal in UTF-8 is expressly illegal in CESU-8, and
    vice versa. You cannot use the same code to serve both purposes, unless
    you include a flag to indicate whether the encoder and decoder are in
    "UTF-8 mode" or "CESU-8 mode."

    And no, you can't get around this by "ignoring surrogates," as some
    people still believe. Supplementary characters are full members of the
    Unicode code space.

    Still later:

    > For higher compression of Unicode strings, but that does not preserve
    > the ASCII encoding, look into SCSU and BOCU specifications in UTS.
    > They are more complex and for now implemented in few softwares.

    SCSU absolutely does "preserve the ASCII encoding," except for some
    infrequently used control characters. Please check the facts before
    making statements like this.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 12:59:49 EDT