RE: Unicode conformant character encodings and us-ascii

Date: Wed May 14 2003 - 20:26:22 EDT

  • Next message: Eugene Mandel: "weird UTF-8 encoding in MS Exchange 2000 IM client"

    I see now why you thought the question was odd. I actually meant to ask about the various iso (e.g. 8859 variants) and windows character encodings.

    > Does anyone know if all character encodings that conform to the Unicode spec

    There is only *one* character encoding that conforms to the "Unicode spec",
    namely, the Unicode character encoding.

    > reserve 0x00 - 0x7F to us-ascii characters?

    But from this, I infer what you are trying to get at is whether UTF-8,
    UTF-16, UTF-32 (each of which is an encoding *form* of the Unicode
    character encoding) all reserve those values as ASCII characters.

    For the character *encoding*, the answer is yes: U+0000..U+007F are
    exactly identical to the characters of ASCII.

    For the character encoding *forms*, the answer is no.

    In UTF-8, which uses 8-bit code units, 0x00..0x7F are always used
    only for U+0000..U+007F, respectively. But for UTF-16, which uses
    16-bit code units, and UTF-32, which uses 32-bit code units, the
    individual byte values are meaningless, and you could encounter
    an 0x00..0x7F byte value anywhere in the middle of a code unit,
    and it would have nothing to do with ASCII values.

    > If there a spec that require this behavior, which spec is it?

    The Unicode Standard. ;-)


    and, in particular, Section 3.9, Encoding Forms.


    > Or can anyone give me an example of a conformant character
    > encoding that does not reserve these bytes to us-ascii?
    > thanks

    This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 21:04:13 EDT