Ill-formed sequences (was: Re: UTF-16 inside UTF-8)

From: Doug Ewell ([email protected])
Date: Wed Nov 05 2003 - 01:22:48 EST

Next message: Jungshik Shin: "Re: UTF-16 inside UTF-8"

Previous message: Doug Ewell: "Re: UTF8 and COntrol Characters"
In reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: Addison Phillips [wM]: "RE: Ill-formed sequences (was: Re: UTF-16 inside UTF-8)"
Reply: Addison Phillips [wM]: "RE: Ill-formed sequences (was: Re: UTF-16 inside UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

While we are on the subject of ill-formed sequences, I was disappointed
to read the following in the Adobe PDF Reference, Fourth Edition, which
describes PDF version 1.5 and was published only ten weeks ago:

> Note: PDF does not prescribe what UTF-8 sequence to choose for
> representing any given piece of externally specified text as a name
> object. In some cases, there are multiple UTF-8 sequences that could
> represent the same logical text. Name objects defined by different
> sequences of bytes constitute distinct name objects in PDF, even
> though the UTF-8 sequences might have identical external
> interpretations.

I assume that by “multiple UTF-8 sequences that could represent the same
logical text,” Adobe is referring to non-shortest UTF-8 sequences such
as <C0 80> and not to Unicode canonical equivalences or something else.
No similar warning about “multiple sequences” is given in the sections
that deal with UTF-16.

Assuming that, this only serves to perpetuate the myth that non-shortest
UTF-8 sequences are permitted in Unicode. One can cite the “tightening”
of the definition of UTF-8 that occurred with Unicode 3.1 and 3.2 as a
policy change, but the fact is that encoders have *never* been allowed
to generate non-shortest sequences.

Earlier conformance requirements that allowed decoders to interpret
non-shortest forms were intended only to save a few CPU cycles for mid-’
90s processors, not to give encoders free rein to generate what we now
think of as ill-formed UTF-8 text. And in fact, the likelihood is that
very little such text exists in the real world.

Even the original “FSS-UTF” definition by Ken Thompson, which was
written ten YEARS ago, made this clear:

> When there are multiple ways to encode a value, for example UCS 0,
> only the shortest encoding is legal.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Jungshik Shin: "Re: UTF-16 inside UTF-8"
Previous message: Doug Ewell: "Re: UTF8 and COntrol Characters"
In reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: Addison Phillips [wM]: "RE: Ill-formed sequences (was: Re: UTF-16 inside UTF-8)"
Reply: Addison Phillips [wM]: "RE: Ill-formed sequences (was: Re: UTF-16 inside UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 02:12:16 EST