Re: plane business

From: John Cowan (cowan@mercury.ccil.org)
Date: Mon Oct 01 2001 - 20:54:06 EDT


Bernard Miller scripsit:

> I’m afraid I have a little bit of a beef about the
> Unicode documentation here, forgive me if this has
> already been brought up. How come UAX #27 says that
> Unicode 3.0 had 34 non characters, 32 of which are in
> supplementary planes? First of all, there are no
> characters defined in supplementary planes in Unicode
> 3.0.

Correct. However, the codepoints FFFE and FFFF in
*every* plane have been non-characters since Unicode
2.0 or even earlier. They were mentioned in ISO 10646
if not in Unicode itself.

> How many planes are defined in Unicode 3.1? UAX #27
> seems to indicate that it depends on what
> transformation format is used (“A process shall
> interpret the Unicode code units in accordance with
> the Unicode Transformation Format used.”). UTF-8 seems
> to only define 17 planes but UTF-32 seems to have 128
> groups of 256 planes.

There are only 17 planes, period. Code units in UTF-32
greater than 0x10FFFF are not valid codepoints.

> UAX #27 says that Unicode 3.1
> defines 3 new supplementary planes... including plane
> 14. I have difficulty with that statement.. does that
> mean that there are only 3 new planes, or that there
> are (at least) 14 new planes, but only 3 of which have
> plane names and characters in them? At least 17 planes
> must be defined in order to define the 32 non
> characters in 16 supplementary planes, that’s what
> common sense would say anyway.

Unicode 3.1 defined characters in three of the
existing 16 supplementary planes. The planes themselves
have been here since 2.0.

> This whole “plane” business suffers from a lack of
> documentation. UAX #27 talks about planes as if it’s
> ancient history but the Unicode 3.0 book does not
> mention planes once (it’s not in the index anyway). I
> would like the Unicode documentation to explain
> exactly what a plane is without requiring the 10646
> documentation which is only available for a fee. In
> fact, according to UAX #27 the planes are defined in
> terms of what WILL be in 10646-2.

A plane is a sequence of 65536 Unicode scalar values,
in the terminology of Unicode 2.0, on a divisible-by-65536 boundary.

> I’m trying to get a grasp on exactly how many planes
> are defined in Unicode in part because it seems to
> affect the number of non characters that are defined.
> I also want to know the maximum number of characters
> that Unicode can encode. So far I reckon there are
> 1114112 (assuming 17 planes) minus 2048 (half
> surrogates) minus 2 (special non characters) minus 32
> (“hidden” non characters) minus 32 (non characters due
> to some arbitrary association between 16 higher planes
> code values and the special non characters code
> values) = 1111998 code positions available for
> characters.

Your reasoning is sound.

> What’s with this 1114111 number I’ve seen
> on this list?

I have no clue.

> BTW, it doesn’t make sense for every code position
> ending in FFFF or FFFE to be a non character.

It doesn't make much sense, but it is the rule anyway.

> Why isn’t the same rule applied to the “hidden” non
> characters, so that every code value ending in FDD0 to
> FDEF is also a non character? Is it to contribute to
> their “hidden” nature?

No. There is simply no reason to reserve them on the other planes.

-- 
John Cowan           http://www.ccil.org/~cowan              cowan@ccil.org
Please leave your values        |       Check your assumptions.  In fact,
   at the front desk.           |          check your assumptions at the door.
     --sign in Paris hotel      |            --Miles Vorkosigan



This archive was generated by hypermail 2.1.2 : Mon Oct 01 2001 - 19:29:08 EDT