Re: Why 17 planes? (was: Re: Why 11 planes?) from Philippe Verdy on 2012-11-27 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 27 Nov 2012 13:01:17 +0100

That's a valid computation if the extension was limited to use only
2-surrogate encodings for supplementary planes.

If we could use 3-surrogate encodings, you'd need
3*2ˆn surrogates
to encode
2^(3*n)
new codepoints.

With n=10 (like today), this requires a total of 3072 surrogates, and you
encode 2^30 new codepoints. This is still possible today, even if the BMP
is almost full and won't allow a new range of 1024 surrogates: you can
still use 2 existing surrogates to encode 2048 "hyper-surrogates" in the
special plane 16 (or for private use in the private planes 14 and 15),
which will combine with the existing low surrogates in the BMP.

This is not complicate to parse it in the foreward direction, but for the
backward direction, it means that when you see the final low surrogate, you
still need to rollback to the previous position: it can only be a leading
high surrogate of the BMP, **or** (this is would be new) another low
surrogate encoding, for which you must still get back to get the leading
high surrogate. This requires a test if starting from a random position,
but at least it remains possible to know where is the leading high
surrogate.

One problem of this scheme is that it is not compatible with UTF-16 because
you would find a sequence like:
<HIGH SURROGATE #1 OF THE BMP, LOW SURROGATE #2 OF THE BMP, LOW
SURROGATE #3 OF THE BMP>
which UTF-16 would parse as:
<VALID SUPPLEMENTARY CODEPOINT FROM SURROGATES(#1,#2), LOW SURROGATE #3 OF
THE BMP>

The first code point is valid, but for UTF-16 working in strict mode, the
trailing low surrogate is isolated. It generates an exception (encoding
error).

But this exception could be handled by varifyng that this isolated low
surrogate follows a codepoint assigned to one of the 2048
"hyper-surrogates" allocated in plane 17, or privately in planes 15 or 16,
in order to encode only private-use codepoints). This would no longer be
valid UTF-16, but something else (say "UTF-X16").

The **current bet** is that such mechanism will **never** be needed for
encoding standard codepoints (which will all fit in the existing 17 planes
(even if 4 of them are almost full and a 5th one will be filled
significantly for sinograms and a 6th one is allocated only for special
codepoints but almost empty), only for encoding more private-use codepoints.

But then, if this need is only for encoding many new private codepoints,
why would we need to encode the final surrogate in the standard range ? You
can do the same thing by allocating the final surrogate in the private use
area of the BMP for that usage. Or equivalently by allocating the 3 ranges
of 1024 private-use surrogates directly in the BMP; the PUA area of the BMP
is large enough to encode these needed 3072 private-use surrogates to
support 2ˆ30 new private-use codepoints, and it does not require any
modification to the existing UTF-16.

In other words, there's no limitation in the number of codepoints for
private use you can encode with UTF-16. We however depend on the decision
that the 17 planes will be enough for all standard uses (otherwise a new
UTF like UTF-X16 above may be standardized, with limited compatibility with
UTF-16).

2012/11/27 "Martin J. Dürst" <duerst_at_it.aoyama.ac.jp>

> Well, first, it is 17 planes (or have we switched to using hexadecimal
> numbers on the Unicode list already?
>
> Second, of course this is in connection with UTF-16. I wasn't involved
> when UTF-16 was created, but it must have become clear that 2^16 (^ denotes
> exponentiation ("to the power of")) codepoints (UCS-2) wasn't going to be
> sufficient. Assuming a surrogate-like extension mechanism, with high
> surrogates and low surrogates separated for easier synchronization, one
> needs
>
> 2 * 2^n
> surrogate-like codepoints to create
>
> 2^(2*n)
> new codepoints.
>
> For doubling the number of codepoints (i.e. a total of 2 planes), one
> would use n=8, and so one needs 128 surrogate-like codepoints. With n=9,
> one gets 4 more planes for a total of 5 planes, and needs 512
> surrogate-like codepoints. With n=10, one gets 16 more planes (for the
> current total of 17), but needs 2048 surrogate codepoints. With n=11, one
> would get 64 more planes for a total of 65 planes, but would need 8192
> codepoints. And so on.
>
> My guess is that when this was considered, 1,048,576 codepoints was
> thought to be more than enough, and giving up 8192 codepoints in the BMP
> was no longer possible. As an additional benefit, the 17 planes fit nicely
> into 4 bytes in UTF-8.
>
> Regards, Martin.
>
> On 2012/11/26 19:47, Shriramana Sharma wrote:
>
>> I'm sorry if this info is already in the Unicode website or book, but
>> I searched and couldn't find it in a hurry.
>>
>> When extending beyond the BMP and the maximum range of 16-bit
>> codepoints, why was it chosen to go upto 10FFFF and not any more or
>> less? Wouldn't FFFFF have been the next logical stop beyond FFFF, even
>> if FFFFFF (or FFFFFFFF) is considered too big? (I mean, I'm not sure
>> how that extra 64Ki chars [10FFFF minus FFFFF] could be important...)
>>
>> Thanks.
>>
>>
>
Received on Tue Nov 27 2012 - 06:04:54 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 27 2012 - 06:04:54 CST