Mark Davis <mark at macchiato dot com> wrote:
> The UTC in has decided to make scalar value mean unambiguously the
> code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate
> code points. While surrogate code points cannot be represented in
> UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
> code points are illegal in all UTFs; notably, they are legal in
They are not legal in UTF-16 unless you believe that the two code points
(0xD800, 0xDC00) are fundamentally equivalent to the single code point
0x10000 -- that is, unless you believe Unicode *is* UTF-16.
UTF-16 does not allow the representation of an unpaired surrogate 0xD800
followed by another, coincidental unpaired surrogate 0xDC00. (It maps
the two to U+10000.) Among the standard UTFs, only UTF-32 allows the
two to be treated as unpaired surrogates. In fact, before UTF-8 was
"tightened up" in 3.2, the only UTF that DID NOT permit these two
coincidental unpaired surrogates was UTF-16.
UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
UTF-32: D800 DC00 <==> 0000D800 0000DC00
- but -
UTF-16: D800 DC00 ==> D800 DC00 ==> 10000
> Ken is pushing for this change; I believe it would be a very bad idea.
> (I think the reasons have already appeared on this list, so I am not
> trying to reopen the discussion; just state the current situation.)
I don't recall seeing the reasons conclusively discussed on this list;
I'd be happy to hear them again. I've been complaining about the
paragraph after D29 for two years now.
This archive was generated by hypermail 2.1.2 : Mon Jul 22 2002 - 23:25:58 EDT