Re: terminology

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 07 2002 - 16:48:18 EDT


Doug Ewell asked:

> Kenneth Whistler <kenw@sybase.com> wrote:
>
> > "Unicode scalar value" was also up for lengthy discussion.
>
> What could there possibly be about "Unicode scalar value" to discuss at
> length? I think that is one of the clearest and least ambiguous
> concepts in the whole Unicode glossary.

Unfortunately, that is not the case. There is a serious disagreement
about the term.

The glossary states:

Unicode Scalar Value: A number N from 0 to 10FFFF<sub>16</sub> defined by
application of the algorithm in Definition D28. Also known as a code point.

But as a number of people have pointed out, there are some logical
inconsistencies when you try to work out the details of D28 and the
transformations defined in Section 3.8 (and in 10646). In particular,
the algorithm in D28 defines N = U for nonsurrogate values, and
N = ( H - 0xD800 ) * 0x400 + ( L - 0xDC00 ) + 0x10000 for surrogate
pairs <H, L>. But from this you cannot get numeric values in the
range 0xD800..0xDFFF -- which I believe is as it should be.

So the bone of contention is whether Unicode Scalar Value should be
defined as equivalent to "code point", as in the current glossary,
or should be defined as equivalent to "nonsurrogate code point", which
is more consistent with the character encoding model and the definition
of the UTF-16 encoding form. The latter, by the way, is the consensus
which was just reached by the UTC meeting last week.

--Ken



This archive was generated by hypermail 2.1.2 : Tue May 07 2002 - 17:42:20 EDT