From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 01:22:48 EST
While we are on the subject of ill-formed sequences, I was disappointed
to read the following in the Adobe PDF Reference, Fourth Edition, which
describes PDF version 1.5 and was published only ten weeks ago:
> Note: PDF does not prescribe what UTF-8 sequence to choose for
> representing any given piece of externally specified text as a name
> object. In some cases, there are multiple UTF-8 sequences that could
> represent the same logical text. Name objects defined by different
> sequences of bytes constitute distinct name objects in PDF, even
> though the UTF-8 sequences might have identical external
> interpretations.
I assume that by “multiple UTF-8 sequences that could represent the same
logical text,” Adobe is referring to non-shortest UTF-8 sequences such
as <C0 80> and not to Unicode canonical equivalences or something else.
No similar warning about “multiple sequences” is given in the sections
that deal with UTF-16.
Assuming that, this only serves to perpetuate the myth that non-shortest
UTF-8 sequences are permitted in Unicode. One can cite the “tightening”
of the definition of UTF-8 that occurred with Unicode 3.1 and 3.2 as a
policy change, but the fact is that encoders have *never* been allowed
to generate non-shortest sequences.
Earlier conformance requirements that allowed decoders to interpret
non-shortest forms were intended only to save a few CPU cycles for mid-’
90s processors, not to give encoders free rein to generate what we now
think of as ill-formed UTF-8 text. And in fact, the likelihood is that
very little such text exists in the real world.
Even the original “FSS-UTF” definition by Ken Thompson, which was
written ten YEARS ago, made this clear:
> When there are multiple ways to encode a value, for example UCS 0,
> only the shortest encoding is legal.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 02:12:16 EST