Re: Unicode 4.0 BETA available for review

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 26 2003 - 19:23:25 EST

  • Next message: Markus Scherer: "Re: Unicode 4.0 BETA available for review"

    Frank Tang continued:

    > >If you read through those definitions from Unicode 4.0 carefully,
    > >you will see that UTF-8 representing a noncharacter is perfectly
    > >valid, but UTF-8 representing an unpaired surrogate code point
    > >is ill-formed (and therefore disallowed).
    > >
    > I see a hole here. How about UTF-8 representing a paired of surrogate
    > code point with two 3 octets sequence instead of an one octets UTF-8
    > sequence? It should be ill-formed since it is non-shortest form also,
    > right? But we really need to watch out the language used there so we
    > won't create new problem. I DO NOT want people think one 3 otects of
    > UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8
    > surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.

    This is old news.

    Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
    sequences. There were two types:

       a. 0xC0 0x80 for U+0000 (instead of 0x00)
       b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80 0x80)
       
    Type (b), encoding two surrogate code points as if they were
    characters, instead of encoding the code point of the character
    itself (using the 4-byte form of UTF-8), is what has come to
    be documented as "CESU-8", but it has never been allowed for
    UTF-8. Cf. Unicode 2.0, p. A-8:

       "When converting Unicode values to UTF-8, always use the shortest
        form that can represent those values. ..."
        
    Such language was carried forward into Unicode 3.0, p. 47,
    strengthened to make the point:

       "When converting a Unicode scalar value to UTF-8, the shortest
        form that can represent those values shall be used. ..."
        
    The problem in Unicode 3.0 was that it allowed a loophole for
    *interpretation* of both kinds of non-shortest forms, on the
    assumption that interpretation of non-shortest forms would be
    harmless. That was criticized as a security hole, and was
    addressed in Unicode 3.1 (and tweaked further in Unicode 3.2),

    Unicode 3.2 stated, in C12:

       "Conformant processes cannot interpret ill-formed code
        unit sequences..."
        
    And that is what (a) and (b) above are, namely ill-formed code
    unit sequences.

    The Unicode 4.0 text further strengthens Conformance Clause
    C12, to make this crystal clear:

       "C12 When a process generates a code unit sequence which
        purports to be in a Unicode character encoding form, it shall
        not emit ill-formed code unit sequences.
        
       "C12a When a process interprets a code unit sequence which
        purports to be in a Unicode character encoding form, it
        shall treat ill-formed code unit sequences as an error
        condition, and shall not interpret such sequences as
        characters."
        
    And just in case anyone still has any trouble reading the
    painfully detailed specification of the UTF-8
    encoding form, an explicit note is included there:

       "* Because surrogate code points are not Unicode scalar
          values, any UTF-8 byte sequence that would otherwise
          map to code points D800..DFFF is ill-formed."
          
    So I don't think there is any hole here. If anyone still
    thinks that they can use these 3-octet/3-octet encodings
    of supplementary characters and call it UTF-8, then they
    are either engaging in wishful thinking or are not reading
    the standard carefully enough.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Feb 26 2003 - 20:04:34 EST