RE: Chapter on character sets

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 15 2000 - 22:33:50 EDT


Mike Brown continued this discussion with the comment:

> It would be infinitely easier to explain and understand if the standards
> assigned characters directly to scalar values and then provided the encoding
> forms and schemes as the means of codifying and the values into
> computer-friendly code value sequences. It seems rather convoluted for the
> primary assignments to be made to UTF-16 code value sequences, and
> mentioning scalar values almost as an afterthought. "Oh yeah, you can derive
> scalar values from these code value sequences. *snort*"

This appearance is the unfortunate result of the Unicode Standard having
to evolve over the last decade from a simple 16-bit character encoding
standard to UTF-16 -- and with the progressive working out of the
textual changes that that entailed in a very large book text for
the standard that has a rather substantial "inertia" to it.

The explanation of UTF-16, UTF-8, and the Unicode scalar values was not
very well worked-out in Unicode 2.0. It got a major overhaul in Unicode 3.0,
but for someone coming at the standard fresh, the relation between the UTF-16
encoding form and the Unicode scalar values looks turned on its head in
the standard. Well, in a certain sense it is -- we (the editors) acknowledge
that. This is because it was deigned to be the lesser of two evils to
try to minimize the structural changes to the important Conformance
chapter of the standard (normative and containing language very difficult
to change in major ways without introducing inconsistencies and engendering
largescale complaints from the implementing members of the Consortium).

However, as we roll out Unicode 3.1, containing many new characters for
Plane 1, Plane 2, (and those pesky tag characters on Plane 14), you
will see that the defining data files *will* be referring to characters
by their scalar values, rather than by the surrogate pairs required for
representing them in the UTF-16 encoding form. Also, when the next
printed version of the standard rolls around for Unicode 4.0, characters
will be discussed and referenced by their scalar values, rather than
by surrogate pairs. This is just obviously easier for everyone -- whether
using the Unicode Standard on its own merits or when matching it against
the ISO/IEC 10646 edition, whenever it is republished with all the new
additions on other planes.

You can see this trend already on the list when people are discussing
characters under ballot for 10646-2. They are referred to by their
scalar values, and not by surrogate pairs, except when something about
the UTF-16 encoding form is what is at issue.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT