Re: plane business

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Oct 02 2001 - 12:57:08 EDT


At 10:42 PM 10/1/01 -0700, Bernard Miller wrote:

>--- Asmus Freytag <asmusf@ix.netcom.com> wrote:
> > There are 66 non-characters as of Unicode 3.1, there
> > were 34 non-characters
> > before.
>
>I understand now.. the non characters in 16 higher
>planes were defined first, then the ones in the arabic
>presentation forms block. In this case it is as I
>suspected, just a documentation problem. The book says
>"None of these surrogate pairs has been ASSIGNED in
>this version of the standard" (emphasis mine).

There are three types of things that can be stated for
a code point (code point, not character)
- allocation
- designation
- assignment
Allocation refers to whether the code point is part of
the standard - allocation changed once in the life of
Unicode to include the range 0x10000-0x10FFFF.

Designation refers to the status as character, non-
character, surrogate, private use character, etc.
Designation changed twice in Unicode, once to
designate the surrogates, and once to designate
the 32 characters on the BMP as non-characters.

Assignment refers to assigning a character to a
code point. New assignments are made all the time,
as new characters are added to the standard.
In the early history of Unicode, assignments changed
twice, once to reflect the merger with 10646, and
once to add the Korean Hangul. Future assignment
changes are restricted to adding new assignments.

Because people easily confuse code points and characters,
few people make the distinction between allocation,
designation, and assignment. New text being
drafted for Unicode 4.0 will clarify these terms.

>It
>would merely be misleading to not mention 32 non
>characters in the section called "non characters" and
>to state that there are no characters in the higher
>planes as of Unicode 3.0; but I think we have a bona
>fide incorrect statement to say that no surrogate pair
>has been ASSIGNED when in fact 32 surrogate pairs were
>assigned the status of non characters.

As you can see from the above, they were "designated"
and not "assigned".

> > The reason to put the additional (defined in 3.1)
> > non-characters into the BMP is to allow them to
> > have single codes for UTF-16 implementation -
> > something that doesn't
> > work so well if they are on the higher planes.
>
>I don't understand this, the "arabic" non characters
>are supposed to REPRESENT the "hidden" non characters?

No, implementors in the UTC simply demonstrated a need
to have 32 non-character code points - code points that
they would be free to use internally because they would
never be a legal part of any interchanged data.

For UTF-16 implementations, using the 32 supplementary
non-characters would have forced them to use surrogate
pairs, which is awkward for the kinds of use intended
for internal-use code points. That's why 32 code points
in the BMP were re-designated from 'reserved' to
'non-character'.

A./



This archive was generated by hypermail 2.1.2 : Tue Oct 02 2001 - 11:25:34 EDT