Re: Unicode 4.0 BETA available for review

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 27 2003 - 15:42:43 EST

  • Next message: Yung-Fong Tang: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"

    Frank Tang asked:

    > >> This discussion has been centered around UTF-8. But I hope the
    > >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
    > >>
    > >>. for UTF-32: occurrences of 'surrogates' are ill-formed.
    > >>
    > >>
    > >>
    > How about UTF-32 sequence which the 4 bytes represent value > U+10FFFF ?
    > Are they considered ill-formed? Should they?

    Yes, they are ill-formed.

    Since all the encoding forms are based on the Unicode scalar values,
    and since the Unicode scalar values are *defined* to be the
    range 0x0000..0xD7FF, 0xE000..0x10FFFF, any attempt to represent
    a code point higher than U+10FFFF in *any* encoding form is
    ill-formed.

    This will be called out explicitly in the Unicode 4.0 text, in
    case anyone still has the question:

    " * Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is
        ill-formed."
        
    I can keep answering these questions, but I can also assure
    everyone that the UTC worked *very* hard this time around to
    make the character encoding model much clearer in the Unicode 4.0
    text, and to anticipate all these edge cases.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 16:27:20 EST