Re: CodePage Information

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 22 2003 - 19:02:54 EDT

  • Next message: Kenneth Whistler: "Re: Is it true that Unicode is insufficient for Oriental languages?"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > So Doug is correct. 0xC0 0x80 is not a permissible representation
    > of U+0000 in UTF-8, and it is bad advice to recommend to people
    > that they should use it.

    This is not what I said or meant. The main reason why the 0x00 byte causes problems is because it is most often used as a string terminator, unlike what ASCII or Unicode defines for the NULL character. In this case, one cannot encode it because the device or protocoldoes not support sending a separate length specifier and needs the 0x00 to terminate the string, and thus a NULL character in a Unicode string could not be encoded even if it's needed.

    This is the case where an escaping mechanism, using other unused parts of UTF-8 can make sense, and I don't think that Sun made an error when using such escaping mechanism to allow sending strings containing a significant NULL character through JNI (and at the time when Sun used it for Java, it was a valid and compliant UTF-8 encoding for that character, and I see no good reason why Sun would change this without breaking the ascending compatibility of JNI, which is a *published* interface since long, but not an internal encoding used only within compiled/serialized classes).

    I never said that such (0xC0; 0x80) sequence is now a valid UTF-8 encoding (yes now it's prohibited). I just say that this is an upper-level encoding on top of UTF-8 needed for the very common case where the 0x00 byte is interpreted as a string terminator and is not part of the string content, and there's no other way to specify a total encoded length to integrate that null byte as a significant character.

    It may be the only way to represent Unicode strings that need to include NULL characters with a huge set of C libraries that depend on the fact that 0x00 is NOT part of the encoded string and is ALWAYS a string terminator.

    But for now such derived encoding has no new formal name: the old definition of UTF-8 was enough, but the new restriction of UTF-8 forgot to assign a name to this case (only CUSE-8 was considered has meriting a technical report and a new name but this addresses a distinct problem or legacy usage). I think that both UTF-8 or CUSE-8 should have a variant accepting this escaping mechanism for the NULL character as the only way to represent it safely (UTF-8-NULL? CUSE-8-NULL ?)



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 19:57:49 EDT