Ill-formed sequences (was: Re: UTF-16 inside UTF-8)

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 01:22:48 EST

  • Next message: Jungshik Shin: "Re: UTF-16 inside UTF-8"

    While we are on the subject of ill-formed sequences, I was disappointed
    to read the following in the Adobe PDF Reference, Fourth Edition, which
    describes PDF version 1.5 and was published only ten weeks ago:

    > Note: PDF does not prescribe what UTF-8 sequence to choose for
    > representing any given piece of externally specified text as a name
    > object. In some cases, there are multiple UTF-8 sequences that could
    > represent the same logical text. Name objects defined by different
    > sequences of bytes constitute distinct name objects in PDF, even
    > though the UTF-8 sequences might have identical external
    > interpretations.

    I assume that by “multiple UTF-8 sequences that could represent the same
    logical text,” Adobe is referring to non-shortest UTF-8 sequences such
    as <C0 80> and not to Unicode canonical equivalences or something else.
    No similar warning about “multiple sequences” is given in the sections
    that deal with UTF-16.

    Assuming that, this only serves to perpetuate the myth that non-shortest
    UTF-8 sequences are permitted in Unicode. One can cite the “tightening”
    of the definition of UTF-8 that occurred with Unicode 3.1 and 3.2 as a
    policy change, but the fact is that encoders have *never* been allowed
    to generate non-shortest sequences.

    Earlier conformance requirements that allowed decoders to interpret
    non-shortest forms were intended only to save a few CPU cycles for mid-’
    90s processors, not to give encoders free rein to generate what we now
    think of as ill-formed UTF-8 text. And in fact, the likelihood is that
    very little such text exists in the real world.

    Even the original “FSS-UTF” definition by Ken Thompson, which was
    written ten YEARS ago, made this clear:

    > When there are multiple ways to encode a value, for example UCS 0,
    > only the shortest encoding is legal.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 02:12:16 EST