Re: Encoding of non-characters

From: Mark Davis (markdavis@ispchannel.com)
Date: Sat Jul 29 2000 - 14:38:19 EDT


You have a good point.

Let me repeat your example. A UTF-16BE converter takes the codepoints (aka
scalar values)

code points: U+DC00 U+D800 U+DC00 U+D800 U+FFFF

converts them to bytes:

UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF

When it converts back, it gets

code points: U+DC00 U+10000 U+D800 U+FFFF

It does not round-trip the precise codepoints, although it does round-trip
the meaning of the mal-formed text (a mal-formed pair of surrogate
codepoints having no other possible interpretation).

Let's step back to the original goal, which is to preserve absolute data
fidelity when the source is UTF-16 (no matter what the contents, even
mal-formed). When you go from mal-formed UTF-16 to codepoints and back, you
get identical text.

UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF
to code points: U+DC00 U+10000 U+D800 U+FFFF
to UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF

The same is true even when going through other UTFs

UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF
to code points: U+DC00 U+10000 U+D800 U+FFFF
to UTF-8: ED B0 80 F0 90 80 80 ED A0 80 EF BF BF
to code points: U+DC00 U+10000 U+D800 U+FFFF
to UTF-16BE: DC 00 D8 00 DC 00 D8 00 FF FF

The wording on page 46 should be improved to make this all clear.

As to your point about conformance, this *is* a bit tricky. However, I can
always have extra machinery on top of any Unicode-conformant process. I
could, for example, have an function that (optionally) caught and threw an
exception any time a piece of text contained the letters "macchiato", but
otherwise converted it to UTF-8. (Say because I was particularly interested
in that variety of expresso drink.) That function looks like the following
pseudocode:

byte[] cvtUTF8WithOption(char[] text, boolean check) {
 if (check && contains(text, "macchiato")) throw exception;
 return cvtUTF8(text);
}

Similarly, I could have a collation routine that checked for any character
that was not in a particular range (my supported subset), and threw an
exception to let the programmer know that those characters were not
supported.

I don't believe that having such functions makes my program non-conformant.

Mark

Doug Ewell wrote:

> Mark Davis <markdavis@ispchannel.com> wrote:
>
> > Here is the issue. Because of the prevalence of UTF-16, and to
> > preserve the round-tripping of UTFs to and from UTF-16 (even UTF-16
> > containing mal-formed text containing non-characters and/or unpaired
> > surrogates), a UTF must always roundtrip all codepoints between 0 and
> > 10FFFF, inclusive.
>
> Wait, now I'm lost. It was precisely *because* of UTF-16 that I thought
> it was OK not to round-trip U+D800 through U+DFFF. After all, this is
> a characteristic of UTF-16 itself. For example, it cannot round-trip
> the following illegal sequence of four UCS-2 (pre-UTF-16) code points:
>
> U+DC00 U+D800 U+DC00 U+D800
>
> UTF-16 would regard this as the unpaired low surrogate U+DC00, followed
> by the perfectly legal U+10000, followed by the unpaired high surrogate
> U+D800. If I really intended to have four unpaired surrogates, I can't
> use UTF-16 to represent them.
>
> > It is of course permissible for a UTF converter to offer an option to
> > detect and throw an error on any mal-formed text.
>
> Then is it a conformance requirement to round-trip malformed text
> (including illegal Unicode code points), or isn't it?
>
> -Doug Ewell
> Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT