Re: Unicode conformant character encodings and us-ascii

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 15 2003 - 12:08:19 EDT

  • Next message: Philippe Verdy: "Proposing NCC and NCD (Normalized Collation (De)Composition) forms aligned with UCA"

    From: <Yael.Aharon@nokia.com>
    > For the character encoding *forms*, the answer is no.
    >
    > In UTF-8, which uses 8-bit code units, 0x00..0x7F are always used
    > only for U+0000..U+007F, respectively. But for UTF-16, which uses
    > 16-bit code units, and UTF-32, which uses 32-bit code units, the
    > individual byte values are meaningless, and you could encounter
    > an 0x00..0x7F byte value anywhere in the middle of a code unit,
    > and it would have nothing to do with ASCII values.

    Don't forget other Unicode encoding forms: UTF-7, BOCU and SCSU also assign code units in the ASCII range. This is still the same Unicode encoding (for codepoints), but definitely not the same code units.

    Unicode only defines codepoints, not their serialization into code units and not technical aspect such as byte order (which is important for UTF-16 and UTF-32, also used to encode subsets or sursets of Unicode such as the old UCS2 (which is just a restriction of Unicode to the BMP but does not define a specific serialization).

    One could argue that all *precisely defined* legacy character encodings (this includes the new GB2312 encoding) that work on subsets of Unicode are Unicode conformant, as they are encoding forms for their equivalent Unicode strings. However they must be considered as distinct encodings and character sets, because they cannot represent exactly all Unicode strings (including its non normalized forms).

    However ISO2022 is conforming with Unicode, and can be seen as an alternative for general purpose Unicode encoding forms, because of its ability to switch to many encoding forms including UTF* encoding forms. The difference is that its full implementation is extremely complex as it is based on a repertoire of encodings not defined by Unicode, and requires a lot of specific parsers for each supported subsets and subencoding.



    This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 12:57:37 EDT