Re: CodePage Information

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 22 2003 - 21:13:35 EDT

  • Next message: Kenneth Whistler: "Re: Is it true that Unicode is insufficient for Oriental languages?"

    > From: "Kenneth Whistler" <kenw@sybase.com>
    > > So Doug is correct. 0xC0 0x80 is not a permissible representation
    > > of U+0000 in UTF-8, and it is bad advice to recommend to people
    > > that they should use it.

    Philippe Verdy retorted:

    >
    > This is not what I said or meant.

    What you said and (I presume) meant was:

    "... encode a NULL codepoint with the pair of bytes (0xC0; 0x80)."

    I am not claiming that you are claiming that that is legal in
    UTF-8. You stated it was a "trivial extension" of UTF-8 (and
    CESU-8).

    I am claiming that that is bad advice.

    > The main reason why the 0x00 byte causes problems is because it
    > is most often used as a string terminator, unlike what ASCII or
    > Unicode defines for the NULL character.

    No. The reason why the 0x00 byte causes problems is because
    people who have not sufficiently familiarized themselves with
    the structure of the standard assume that they can treat
    byte-serialized UTF-16 with standard C string API's (and
    similar protocols). And when they discover that a UTF-16
    byte serialization uses all byte values, including 0x00, they
    tend to assert that that is a problem with Unicode instead
    of being a problem with their choice of encoding schemes.

    As you originally indicated, choice of UTF-8 as an encoding scheme
    deals with this problem. It deals with it because U+0000
    is represented as 0x00, and 0x00 never appears in a valid
    UTF-8 byte serialization meaning anything other than U+0000.

    > In this case, one cannot encode it because the device or protocol
    > does not support sending a separate length specifier and needs
    > the 0x00 to terminate the string, and thus a NULL character in
    > a Unicode string could not be encoded even if it's needed.

    Or an *ASCII* string! You are missing the main point that
    this has nothing to do with Unicode.

    If I use only 8-bit ASCII data with C string API's, then
    I am similarly precluded from embedding the ASCII NUL (0x00)
    character in a string, because of the string termination
    convention used in C. In that case I don't have some
    option to "escape" the NUL to 0xC0 0x80 to use it embedded
    in the string.

    Well, 8-bit ASCII data is also valid UTF-8 data, and it
    should behave no differently. If I have a NUL character,
    it should be represented as 0x00, just as it is in ASCII.

    If I have such character data in an array, with a NUL
    character in it, well, I obviously can't just point that
    array at strlen() and get the right results. But that
    is no different for ASCII than for UTF-8.

    > This is the case where an escaping mechanism, using other
    > unused parts of UTF-8 can make sense,

    It never makes sense to "use other unused parts of UTF-8".
    The UTC has gone to rather extreme lengths to keep ad hoc
    "trivial extensions" of UTF-8 from being promulgated, so
    as to preserve the interoperability of all UTF-8 data.

    > and I don't think that Sun made an error when using such
    > escaping mechanism to allow sending strings containing a
    > significant NULL character through JNI (and at the time
    > when Sun used it for Java, it was a valid and compliant
    > UTF-8 encoding for that character,

    This is an erroneous claim. <0xC0 0x80> has *never* been a
    "valid and compliant UTF-8 encoding" for U+0000.
    ISO/IEC 10646-1:1993/AMD.2: 1994 (E), which added UTF-8 to
    10646, clearly maps U+0000 to UTF-8 octet 0x00, and
    disallows <0xC0 0x80> as the UTF-8 mapping of any UCS
    code position. <0xC0 0x80> would be a "malformed sequence"
    by P.7 in that Amendment. Even the predecessor of UTF-8,
    published in Unicode 1.1 as "FSS-UTF", clearly stated:

      "When there are multiple ways to encode a value, for
       example U+0000, only the shortest encoding is legal."
       
             UTR #4, The Unicode Standard, Version 1.1, p. 28 (1993)
             
    It was wishful thinking, on the part of some implementers,
    that it would be o.k. to use non-shortest forms of UTF-8
    to represent characters. And people took shortcuts in
    programming their UTF-8 decoders, because it is easier to
    just let the algorithm bit-shift <0xC0 0x80> to U+0000 than
    to range check and raise an exception for the illegal
    sequences. The wording in Unicode 2.0 and Unicode 3.0
    unfortunately encouraged such short-cuts, but because of
    the trouble that has caused, the UTC clamped down even on those
    laxities and stated that it means what it says about non-shortest
    forms being disallowed.

    So I return to my original point: it is bad advice to encourage
    people to use <0xC0 0x80> to represent U+0000 as an extension
    of UTF-8.

    > and I see no good reason why Sun would change this without
    > breaking the ascending compatibility of JNI, which is a
    > *published* interface since long, but not an internal encoding
    > used only within compiled/serialized classes).

    Well, existing JNI interfaces won't change. But data encoded
    with non-shortest forms is *not* UTF-8, and people need to
    understand if they are using such data, it can lead to
    interoperability problems.

    > I never said that such (0xC0; 0x80) sequence is now a valid
    > UTF-8 encoding (yes now it's prohibited).

    It was *always* prohibited. It was merely *tolerated* when people
    did what was prohibited anyway. Now people are spanked instead
    of being given a wink and a nod.

    > I just say that this is an upper-level encoding on top of
    > UTF-8 needed for the very common case where the 0x00 byte is
    > interpreted as a string terminator and is not part of the
    > string content, and there's no other way to specify a total
    > encoded length to integrate that null byte as a significant
    > character.

    And I just say that this is a C string termination issue, and
    has nothing to do with UTF-8. The exact same problem for NUL
    applies to *every* other character encoding in any kind of
    widespread use, including all the "ASCII"-based ones (and DBCS)
    and the EBCDIC code pages. You don't go to 8859-1 or Code Page 437
    or MacRoman or Code Page 037 or GB 2312 and create "upper-level
    encodings" with escape mechanisms for 0x00 just so you could
    put NUL's into the strings for them for use with C runtime
    libraries. If you need to embed NUL characters in character
    arrays and treat them as strings (for *any* of these encodings),
    you modify and extend your string libraries so that they
    don't depend on null-termination of the strings.

    > It may be the only way to represent Unicode strings that need
    > to include NULL characters with a huge set of C libraries that
    > depend on the fact that 0x00 is NOT part of the encoded string
    > and is ALWAYS a string terminator.

    The implication is totally erroneous, because this is nothing
    Unicode-specific, but applies to *every* character encoding in
    any significant use. You imply that the "only way" to handle
    this problem for Unicode strings is by a "trivial extension"
    to a Unicode-specific encoding form, when the problem is the
    same for every character encoding, and nobody advocates hacking
    up all the other encodings to fix the same problem for them.

    >
    > But for now such derived encoding has no new formal name:
    > the old definition of UTF-8 was enough,

    ?? It did not allow it. Why would it have been given a formal
    name? It had an *informal* name: non-shortest form UTF-8,
    and was designated not to be a legal form.

    > but the new restriction of UTF-8 forgot to assign a name to
    > this case

    See above. It was not an oversight.

    > (only CUSE-8 was considered has meriting a technical report
            ^^^^^^
            CESU-8
    > and a new name but this addresses a distinct problem or legacy
    > usage). I think that both UTF-8 or CUSE-8 should have a variant
    > accepting this escaping mechanism for the NULL character as the
    > only way to represent it safely (UTF-8-NULL? CUSE-8-NULL ?)

    And I think this is a *terrible* idea, which will be roundly
    rejected. Let me state it one last time: it is bad advice to
    recommend that people use <0xC0 0x80> to represent U+0000
    (as any kind of extension to UTF-8).

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 22:09:16 EDT