Re: RFC, 5-6 octets sequence in UTF8, non short form in UTF8

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Feb 19 2003 - 11:44:59 EST

  • Next message: Asmus Freytag: "Re: Wrong Charakter Categories (was: Hot Beverage font)"

    Yung-Fong Tang <ftang at netscape dot com> wrote:

    > I read the RFC 2279 again (
    > http://www.cis.ohio-state.edu/cs/Services/rfc/rfc-text/rfc2279.txt )
    > 1. I cannot find any text in it mentioned about. non short form is
    > invalid UTF8, and

    First, we've already established that a revision to RFC 2279 is in the
    works.

    That said, the existing RFC 2279 says the following:

    "Encoding from UCS-4 to UTF-8 proceeds as follows:

    "1) Determine the number of octets required from the character value
        and the first column of the table above. It is important to note
        that the rows of the table are mutually exclusive, i.e. there is
        only one valid way to encode a given UCS-4 character."

    The phrase "only one valid way" makes it very clear, at least to me,
    that non-shortest forms are invalid. And in the "Security
    Considerations" section, overlong sequences are referred to as "illegal
    UTF-8 sequences." This has not changed in the draft replacement,
    probably because it is already sufficient.

    > 3. It mentioned about how to encode surrogate pair to UTF-8. But it
    > does not say the UTF8 sequence mapping directly to Surrogate High and
    > Surrogate Low are illegal

    Again, from RFC 2279:

    "UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
    into pairs of UCS-2 values from a reserved range. UTF-16 impacts
    UTF-8 in that UCS-2 values from the reserved range must be treated
    specially in the UTF-8 transformation."

    and again:

    "The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
    obtained from the above, in principle, by simply extending each
    UCS-2 character with two zero-valued octets. However, pairs of
    UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
    parlance), being actually UCS-4 characters transformed through
    UTF-16, need special treatment: the UTF-16 transformation must be
    undone, yielding a UCS-4 character that is then transformed as
    above."

    It's pretty hard to read these paragraphs and come away with the
    impression that it's OK to map directly between UTF-8 and UTF-16 code
    units. Only by ignoring the existence of UTF-16 and these passages in
    RFC 2279, and treating every 16-bit code unit as a character (as some
    database vendors evidently did), would this even be necessary. The only
    "shortcoming" in the RFC is that it doesn't use the word "illegal" to
    describe this.

    The draft replacement adds the following, which should remove all doubt:

    "The definition of UTF-8 prohibits encoding character numbers between
    U+D800 and U+DFFF, which are reserved for use with the UTF-16
    encoding form (as surrogate pairs) and do not directly represent
    characters. When encoding in UTF-8 from UTF-16 data, it is necessary
    to first decode the UTF-16 data to obtain character numbers, which
    are then encoded in UTF-8 as described above."

    Side note: I'm a little disappointed that the draft replacement goes on
    to include a description of CESU-8, which is basically a perversion of
    UTF-8 for processes that are ignorant of UTF-16, and which the RFC later
    (and correctly) refers to as "a naive implementation." CESU-8 is best
    kept in a dark closet and used internally only by processes that have no
    choice, and not publicized any more than necessary.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Wed Feb 19 2003 - 12:32:18 EST