Re: Abstract character?

From: Mark Davis (
Date: Wed Jul 24 2002 - 11:19:42 EDT

I disagree with Ken, but don't have time now to write a lengthy
reply.. I'll try to get to that soon.

◄ “Eppur si muove” ►

----- Original Message -----
From: "Doug Ewell" <>
To: <>
Cc: "Kenneth Whistler" <>
Sent: Tuesday, July 23, 2002 19:44
Subject: Re: Abstract character?

> Kenneth Whistler <kenw at sybase dot com> wrote:
> >> UTF-16 does not allow the representation of an unpaired surrogate
> >> 0xD800 followed by another, coincidental unpaired surrogate
> >> (It maps the two to U+10000.) Among the standard UTFs, only
> >> allows the two to be treated as unpaired surrogates.
> >
> > Actually, not that, either.
> >
> >> In fact, before UTF-8 was
> >> "tightened up" in 3.2, the only UTF that DID NOT permit these two
> >> coincidental unpaired surrogates was UTF-16.
> >>
> >> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
> >> UTF-32: D800 DC00 <==> 0000D800 0000DC00
> >
> > This is ill-formed in UTF-32, and thereby, illegal.
> I'm glad to hear that unpaired surrogates are now also illegal in
> UTF-32, and presumably also in UTF-16. However, I did do my
> before writing yesterday's post, and that wasn't the impression I
> so I sense another opportunity to tighten up the definitions before
> Unicode 4.0 is released.
> In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular
> Sequences" starts out talking about "transformation formats such as
> UTF-8." However, the rest of the section deals exclusively with
> UTF-16 and UTF-32 are not mentioned.
> UAX #19, "UTF-32" (written by Mark) is listed in the header block as
> having been updated to Unicode 3.2, but it does not state anywhere
> unpaired surrogates are illegal. In particular, the following
> from UAX #19 led me to believe that all code points, from 0x0000
> 0x10FFFF inclusive, are legal in UTF-32:
> "UTF-32 is restricted in values to the range 0..10FFFF<sub>16</sub>,
> which precisely matches the range of characters defined in the
> Standard (and other standards such as XML), and those representable
> UTF-8 and UTF-16."
> "(b) An illegal UTF-32 code unit sequence is any byte sequence that
> would correspond to a numeric value outside of the range 0 to
> 10FFFF<sub>16</sub>.
> "(c) An irregular UTF-32 code unit sequence is an eight-byte
> where the first four bytes correspond to a high surrogate, and the
> four bytes correspond to a low surrogate. As a consequence of C12,
> irregular UTF-32 sequences shall not be generated by a conformant
> process."
> I suggest that the Unicode 4.0 text specifically state, in
> terms, which code points are and are not valid in UTF-8, UTF-16, and
> UTF-32. And if it is true that the surrogate code points 0xD800
> 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised
> state this unambiguously.
> -Doug Ewell
> Fullerton, California

This archive was generated by hypermail 2.1.2 : Wed Jul 24 2002 - 09:32:35 EDT