Re: Names of planes, and request for sneak preview

From: Mark Davis (markdavis@ispchannel.com)
Date: Tue Jul 11 2000 - 11:27:43 EDT

Next message: Alan Wood: "RE: How-To handle i18n when you don't know charset?"
Previous message: Roozbeh Pournader: "RE: Not all Arabics are created equal..."
Maybe in reply to: Doug Ewell: "Names of planes, and request for sneak preview"
Next in thread: John H. Jenkins: "Re: Names of planes, and request for sneak preview"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

We haven't used the notion of Planes and Groups. These actually derived, as far
as I can remember from early days in L2, from later-discarded mechanisms that
would let you swap in planes into the BMP. Thus it was important to distinguish
these levels. Planes and Groups are themselves not particularly useful in
Unicode, which has a flat coding space from 0 to 10FFFF. We basically just use
them now in communicating with our 10646 brethren.

However, there are certain units or thresholds that are useful to distinguish
in Unicode. The most important threshold is the one between FFFF and 10000:
important for UTF-16 implementations (and to a minor degree, UTF-8
implementations). So there are terms for codepoints above and below that. I've
heard the following used:

BMP characters: those with codepoints < 10000 (borrowing BMP from 10646)
aka UCS-2 characters
aka non-surrogate characters

non-BMP characters: those with codepoints > FFFF
aka non-UCS-2 characters
aka surrogate characters

Note: D800 - DFFF are *not* surrogate characters. They are surrogate
codepoints, two of which (in UTF-16) represent a surrogate character. The
disadvanatage of using this term "surrogate character" is because of this
possible confusion; you don't have the same problem if you say "non-BMP
character". It would be nice to have another positive, non-acronymic term for
those characters above FFFF, but none has yet arisen.

There are other useful boundaries:

Column - 16 values with all but the last 4 binary digits the same: e.g.
2060-206F

Window - 128 values with all but the last 7 binary digits the same, e.g.
2000-207F, or 2080-20FF. Used in SCSU; for compression, blocks of 128 are
useful. In the UTC, we try not to span window boundaries unnecessarily when
allocating characters (for historical reasons, we used to violate this, cf
Hebrew or the Kanas).

aka half-row (In 10646, Row is 256 values with all but the last 8 binary digits
the same).

Surrogate Block - 1024 values with all but the last 10 binary digits the same,
e.g. E0000-E0400. In UTF-16, these have the same high (leading) surrogate code
value. In the UTC, we try not to span surrogate block boundaries unnecessarily
when allocating characters.

Mark

Doug Ewell wrote:

> John Cowan <jcowan@reutershealth.com> wrote:
>
> >> Everybody and his cat should know that BMP stands for Basic Multilingual
> >> Plane, and the Roadmap pages also show that SMP is short for Secondary
> >> Multilingual Plane. What are SIP and GPP?
> >
> > Supplementary Ideographic Plane, General Purpose Plane. Note that these
> > are 10646 names, not Unicode names.
>
> Interesting... I hadn't looked at it that way. I know that the entire
> group/plane/row/cell breakdown is a 10646 thing. Is there a Unicode-
> specific term for the range from U+0000 to U+FFFD, the code points that
> can be represented without surrogates?
>
> -Doug Ewell
> Fullerton, California

Next message: Alan Wood: "RE: How-To handle i18n when you don't know charset?"
Previous message: Roozbeh Pournader: "RE: Not all Arabics are created equal..."
Maybe in reply to: Doug Ewell: "Names of planes, and request for sneak preview"
Next in thread: John H. Jenkins: "Re: Names of planes, and request for sneak preview"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT