Re: Abstract character?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 23 2002 - 18:10:23 EDT


Following up on several responses on this thread.

Mark Davis said:

> A small correction to Ken's message:
>
> > The Unicode scalar value
> > definitionally excludes D800..DFFF, which are only code unit
> > values used in UTF-16, and which are not code points associated
> > with any well-formed UTF code unit sequences.
>
> The UTC in has decided to make scalar value mean unambiguously the
> code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate
> code points.

Correct.

> While surrogate code points cannot be represented in
> UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
> code points are illegal in all UTFs; notably, they are legal in
> UTF-16.

Not to pick nits here... oh well, o.k., I'll pick nits.

I stated that "D800..DFFF ... are not code points associated
with any well-formed UTF code unit sequence". I believe, as stated,
that that is correct. An isolated surrogate in UTF-16 is *not*
a well-formed UTF code unit sequence. Even by the disputed text
of Unicode 3.0, an isolated surrogate code unit in UTF-16 would
be an "irregular code value sequence".

And with the updated relevant text in Unicode 3.2, I think
there is even less wiggle-room. The last vestige of "irregular
code unit sequence" vanished in Unicode 3.2 when the loophole for
UTF-8 was closed. The Unicode 3.2 standard now reads:

"Terminology to distinguish ill-formed, illegal, and irregular
code unit sequences is no longer needed. There are no irregular
code unit sequences, and thus all ill-formed code unit sequences
are illegal. It is illegal to emit or interpret any ill-formed
code unit sequence. Unicode 4.0 will revise the terminology
and conformance clauses in light of this."

>
> Ken is pushing for this change; I believe it would be a very bad idea.

I believe it is a worse idea to carry forward the claim that
(isolated) surrogate code points cannot be represented in
UTF-8 (as is definitely the case for Unicode 3.2) while they
can be represented in UTF-16.

> (I think the reasons have already appeared on this list, so I am not
> trying to reopen the discussion; just state the current situation.)

Doug Ewell followed up:

> UTF-16 does not allow the representation of an unpaired surrogate 0xD800
> followed by another, coincidental unpaired surrogate 0xDC00. (It maps
> the two to U+10000.) Among the standard UTFs, only UTF-32 allows the
> two to be treated as unpaired surrogates.

Actually, not that, either.

> In fact, before UTF-8 was
> "tightened up" in 3.2, the only UTF that DID NOT permit these two
> coincidental unpaired surrogates was UTF-16.
>
> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
> UTF-32: D800 DC00 <==> 0000D800 0000DC00

This is ill-formed in UTF-32, and thereby, illegal.

> - but -
> UTF-16: D800 DC00 ==> D800 DC00 ==> 10000

David Hopwood responded:

> I think it would be a mistake for the standard to refer to "surrogate
> code points".

I think this was already definitely decided by the UTC.

> The term "code point" is used for other CCS's where there
> may also be gaps in the code space; in that case, the gaps are not
> considered valid code points.

I am sympathetic with this point of view, but it isn't easy to draw
such a line in practice. Look at the various Asian DBCS sets -- they
often had ranges of byte values that were considered invalid as
parts of encoded characters, and if you mapped them out to an integral
space, you would end up with ranges of integers that were invalid as
code points. But when push came to shove, various of these encodings
just appropriated some of these ranges to extend themselves, and
filled them with more characters. What was an invalid code point
became a valid (and assigned) code point.

> When 0xD800..0xDFFF are used in UTF-16,
> they are used as code units, not code points. As Unicode code points,
> 0xD800..0xDFFF are (or at least should be) invalid in the same sense
> that 0x110000 is.

As Unicode code points they are invalid in a different sense than
0x110000 is, actually. 0x110000 could, by the integral transforms
involved, be represented by UTF-8 or by UTF-32, but not by UTF-16.
0xD800 could, in principle, be represented by UTF-16, if you
allowed the range, but is ruled to be ill-formed in all three
UTF's, to avoid the kinds of irregular sequences that the UTC was
just at pains to eliminate.

>
> I.e. IMHO "Unicode scalar value" and "Unicode code point" should be
> synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10FFFF.

I think the distinction in ranges is a useful one, since it allows
for a bijective definition of the UTF's, based on the Unicode scalar
value, but it also gives a meaning to the complete integral range
for the code points, as demanded by some of the implementers.

> "code point" should be defined as an integer corresponding to an
> encoded character in any CCS, not just Unicode.

This doesn't really work, since it doesn't account for the
unassigned (reserved) code points, nor the noncharacters.
The Unicode architecture for its codespace is more complex than
any other CCS, precisely because the encoding is more complex:
only Unicode has three bijective encoding forms, and only Unicode
has noncharacters. These need to be taken into account.

> The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO
> allowing them as code points (i.e. allowing any process to conformantly
> generate unpaired surrogates) is a really bad idea.

This I agree with.

> The set of code
> point sequences that are validly representable in each UTF should be
> identical (which ensures that mappings between UTFs are bijective and
> always succeed iff the input is valid in the source UTF).

I also think this is of paramount importance.

> I.e. U+D800..DFFF, like U+110000, should be undesignated and
> unrepresentable.

However, you can't go quite this far. As Markus pointed out, code points
themselves may have properties -- even code points which cannot, in
principle, be assigned to characters. And there are already existing
APIs which handle these code points. Their function is clearly
*designated* by the standard, normatively; that, however, is different
from saying that an abstract character could ever be assigned to them.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 16:27:16 EDT