Re: Encoding of non-characters

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Aug 01 2000 - 00:51:41 EDT


Mark Davis <markdavis@ispchannel.com> or perhaps <mark@macchiato.com>
wrote:

> Let me repeat your example. A UTF-16BE converter takes the codepoints
> (aka scalar values)
>
> code points: U+DC00 U+D800 U+DC00 U+D800 U+FFFF

Not exactly the same as my example, which didn't have the trailing
U+FFFF. But it doesn't matter, since U+D800 followed by either EOF or
U+FFFF is an unpaired low surrogate either way, so let's just go with
it.

> converts them to bytes:
>
> UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF

Which presumes that it's even supposed to do that. I thought UTF-16
decoders and other Unicode-compliant processes were supposed to treat
unpaired surrogates as if they, well, didn't mean jack. I personally
replace them with U+FFFD. Is that wrong?

> code points: U+DC00 U+10000 U+D800 U+FFFF
>
> It does not round-trip the precise codepoints, although it does
> round-trip the meaning of the mal-formed text (a mal-formed pair of
> surrogate codepoints having no other possible interpretation).

See, but what it's really doing is taking two "wrongs" (unpaired
surrogates that happen to appear next to each other) and turning them
into one "right" (a Plane 1 character). Remember I started with the
four non-characters U+D800 U+DC00 U+D800 U+DC00, and we just converted
two of them to a real character, U+10000.

> Let's step back to the original goal, which is to preserve absolute
> data fidelity when the source is UTF-16 (no matter what the contents,
> even mal-formed).

True only if UTF-16 is the ultimate, canonical format of Unicode, and
other formats such as UTF-8 and UTF-32 are subsidiary to it somehow.

> When you go from mal-formed UTF-16 to codepoints and back, you get
> identical text.
>
> UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF
> to code points: U+DC00 U+10000 U+D800 U+FFFF
> to UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF
>
> The same is true even when going through other UTFs
>
> UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF
> to code points: U+DC00 U+10000 U+D800 U+FFFF
> to UTF-8: ED B0 80 F0 90 80 80 ED A0 80 EF BF BF
> to code points: U+DC00 U+10000 U+D800 U+FFFF
> to UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF

But it doesn't work the other way, i.e. from UTF-8 to UTF-16 and back
to UTF-8:

UTF-8: ED B0 80 ED A0 80 ED B0 80 ED A0 80 EF BF BF
to code points: U+DC00 U+D800 U+DC00 U+D800 U+FFFF
to UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF
to code points: U+DC00 U+10000 U+D800 U+FFFF
to UTF-8: ED B0 80 F0 90 80 80 ED A0 80 EF BF BF

Here UTF-16 is responsible for the data corruption, and UTF-8 is
complicit by allowing the encoding of *both* U+D800 U+DC00 *and*
U+10000, which UTF-16 does not permit.

The same thing happens if you convert from UTF-32 to UTF-16 and back
to UTF-32.

> The wording on page 46 should be improved to make this all clear.

I still think UTF-16 does not even conform to the wording on page 46.

> As to your point about conformance, this *is* a bit tricky. However,
> I can always have extra machinery on top of any Unicode-conformant
> process. I could, for example, have an function that (optionally)
> caught and threw an exception any time a piece of text contained the
> letters "macchiato", but otherwise converted it to UTF-8. (Say because
> I was particularly interested in that variety of expresso drink.) That
> function looks like the following pseudocode:
>
> byte[] cvtUTF8WithOption(char[] text, boolean check) {
> if (check && contains(text, "macchiato")) throw exception;
> return cvtUTF8(text);
> }
>
> Similarly, I could have a collation routine that checked for any
> character that was not in a particular range (my supported subset),
> and threw an exception to let the programmer know that those
> characters were not supported.
>
> I don't believe that having such functions makes my program non-
> conformant.

Well, this part is all my fault for missing the earlier distinction
between the conformance requirements of a UTF and those of a specific
converter for that UTF. I think we all agree that a UTF that threw an
exception on encountering the string "macchiato" would be very non-
conformant. Does this exonerate my UTF converters which throw errors
in response to unpaired surrogates?

In any case, I now have a better understanding of definition D29: any
UTF *except for UTF-16* must be able to encode any possible combination
of Unicode scalar values from U+0000 to U+10FFFF inclusive. I am not
sure I concur that other UTF's must be held to a standard that UTF-16
does not support, but that's the way it is. See, I did learn something
new about Unicode after all.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT