Re: Unicode 4.0 BETA available for review

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Feb 25 2003 - 18:36:40 EST

  • Next message: Doug Ewell: "Re: please review the paper for me"

    Frank Tang asked:

    > so the UTF-8 sequence which represent U+FFFE U+FFFF and U+{1-11}FFF{E,F}
    > are consider legal in Unicode 4.0

    Yes. Such sequences are also legal in Unicode 3.0, 3.1, and 3.2.

    The Unicode Standard, Version 3.0 specified, on p. 46:

    "To ensure that round-trip transcoding is possible, a UTF
    mapping *must also* map invalid Unicode scalar values to
    unique code value sequences. These invalid scalar values
    include FFFE<sub>16</sub>, FFFF<sub>16</sub>, and unpaired
    surrogates."

    The Unicode Standard, Version 3.1 disallowed non-shortest
    UTF-8 sequences, which it defined to be illegal. It disallowed
    the *generation* of irregular UTF-8 sequences (which involve
    the mapping of surrogate code points). Unicode 3.1 also
    defined the term "noncharacter", which includes U+FFFE,
    U+FFFF, the last two characters on each of the other planes,
    and U+FDD0..U+FDEF, and all of *those* values were perfectly
    valid in UTF-8, as shown by Table 3.1B, "Legal UTF-8 Byte
    Sequences."

    The Unicode Standard, Version 3.2, changed the term "illegal"
    to "ill-formed", and disallowed all ill-formed UTF-8
    sequences, including the CESU-8-style irregular sequences.
    However, once again, noncharacters are perfectly valid in
    Table 3.1B, Legal UTF-8 Byte Sequences.

    The relevant text from Unicode 4.0, Chapter 3, is:

    "D28 Unicode scalar value: any Unicode code point except
    high-surrogate and low-surrogate code points."

    "D36 UTF-8 encoding form: the Unicode encoding form which
    assigns each Unicode scalar value to an unsigned byte sequence
    of one to four bytes in length, as specified in Table 3-5.
     * Any UTF-8 byte sequence that does not match the patterns
       listed in Table 3-6 is ill-formed."
       
    And "Table 3-5" is basically equivalent to Table 3-1 of
    Unicode 3.0 (see p. 47), while "Table 3-6" is equivalent
    to Table 3.1B "Legal UTF-8 Byte Sequences", published
    in Unicode 3.2.

    If you read through those definitions from Unicode 4.0 carefully,
    you will see that UTF-8 representing a noncharacter is perfectly
    valid, but UTF-8 representing an unpaired surrogate code point
    is ill-formed (and therefore disallowed).

    Through all of these tightenings of the wording regarding
    UTF-8, it has continuously been true (for Unicode 3.0, 3.1,
    3.2, and 4.0) that UTF-8 for noncharacter code points is valid.

    ISO/IEC 10646-1:2000 had a flaw in it, in that Annex D
    contained language in a note indicating that the UTF-8
    for 0000FFFE and 0000FFFF was not defined (while allowing
    0001FFFE, 0001FFFF, etc.). That flaw was corrected in
    Amendment 1 to ISO/IEC 10646-1:2000, so at this point, the
    definition in the Unicode Standard and the definition in
    10646 are perfectly aligned.

    So let me repeat the summary, for those who have gotten
    this far:

       UTF-8 for noncharacters is *valid*.
       
       UTF-8 for surrogate code points is *ill-formed*. (Unicode-ese)
             for RC-elements is *undefined*. (10646-ese)
             
    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 19:21:49 EST