Re: Encoding of non-characters

From: Doug Ewell (dewell@compuserve.com)
Date: Sat Jul 29 2000 - 13:40:30 EDT


Mark Davis <markdavis@ispchannel.com> wrote:

> Here is the issue. Because of the prevalence of UTF-16, and to
> preserve the round-tripping of UTFs to and from UTF-16 (even UTF-16
> containing mal-formed text containing non-characters and/or unpaired
> surrogates), a UTF must always roundtrip all codepoints between 0 and
> 10FFFF, inclusive.

Wait, now I'm lost. It was precisely *because* of UTF-16 that I thought
it was OK not to round-trip U+D800 through U+DFFF. After all, this is
a characteristic of UTF-16 itself. For example, it cannot round-trip
the following illegal sequence of four UCS-2 (pre-UTF-16) code points:

    U+DC00 U+D800 U+DC00 U+D800

UTF-16 would regard this as the unpaired low surrogate U+DC00, followed
by the perfectly legal U+10000, followed by the unpaired high surrogate
U+D800. If I really intended to have four unpaired surrogates, I can't
use UTF-16 to represent them.

> It is of course permissible for a UTF converter to offer an option to
> detect and throw an error on any mal-formed text.

Then is it a conformance requirement to round-trip malformed text
(including illegal Unicode code points), or isn't it?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT