Re: plane business

From: John Cowan ([email protected])
Date: Mon Oct 01 2001 - 20:54:06 EDT


Bernard Miller scripsit:

> I�m afraid I have a little bit of a beef about the
> Unicode documentation here, forgive me if this has
> already been brought up. How come UAX #27 says that
> Unicode 3.0 had 34 non characters, 32 of which are in
> supplementary planes? First of all, there are no
> characters defined in supplementary planes in Unicode
> 3.0.

Correct. However, the codepoints FFFE and FFFF in
*every* plane have been non-characters since Unicode
2.0 or even earlier. They were mentioned in ISO 10646
if not in Unicode itself.

> How many planes are defined in Unicode 3.1? UAX #27
> seems to indicate that it depends on what
> transformation format is used (�A process shall
> interpret the Unicode code units in accordance with
> the Unicode Transformation Format used.�). UTF-8 seems
> to only define 17 planes but UTF-32 seems to have 128
> groups of 256 planes.

There are only 17 planes, period. Code units in UTF-32
greater than 0x10FFFF are not valid codepoints.

> UAX #27 says that Unicode 3.1
> defines 3 new supplementary planes... including plane
> 14. I have difficulty with that statement.. does that
> mean that there are only 3 new planes, or that there
> are (at least) 14 new planes, but only 3 of which have
> plane names and characters in them? At least 17 planes
> must be defined in order to define the 32 non
> characters in 16 supplementary planes, that�s what
> common sense would say anyway.

Unicode 3.1 defined characters in three of the
existing 16 supplementary planes. The planes themselves
have been here since 2.0.

> This whole �plane� business suffers from a lack of
> documentation. UAX #27 talks about planes as if it�s
> ancient history but the Unicode 3.0 book does not
> mention planes once (it�s not in the index anyway). I
> would like the Unicode documentation to explain
> exactly what a plane is without requiring the 10646
> documentation which is only available for a fee. In
> fact, according to UAX #27 the planes are defined in
> terms of what WILL be in 10646-2.

A plane is a sequence of 65536 Unicode scalar values,
in the terminology of Unicode 2.0, on a divisible-by-65536 boundary.

> I�m trying to get a grasp on exactly how many planes
> are defined in Unicode in part because it seems to
> affect the number of non characters that are defined.
> I also want to know the maximum number of characters
> that Unicode can encode. So far I reckon there are
> 1114112 (assuming 17 planes) minus 2048 (half
> surrogates) minus 2 (special non characters) minus 32
> (�hidden� non characters) minus 32 (non characters due
> to some arbitrary association between 16 higher planes
> code values and the special non characters code
> values) = 1111998 code positions available for
> characters.

Your reasoning is sound.

> What�s with this 1114111 number I�ve seen
> on this list?

I have no clue.

> BTW, it doesn�t make sense for every code position
> ending in FFFF or FFFE to be a non character.

It doesn't make much sense, but it is the rule anyway.

> Why isn�t the same rule applied to the �hidden� non
> characters, so that every code value ending in FDD0 to
> FDEF is also a non character? Is it to contribute to
> their �hidden� nature?

No. There is simply no reason to reserve them on the other planes.

-- 
John Cowan           http://www.ccil.org/~cowan              [email protected]
Please leave your values        |       Check your assumptions.  In fact,
   at the front desk.           |          check your assumptions at the door.
     --sign in Paris hotel      |            --Miles Vorkosigan



This archive was generated by hypermail 2.1.2 : Mon Oct 01 2001 - 19:29:08 EDT