Re: Why 17 planes? (was: Re: Why 11 planes?) from Philippe Verdy on 2012-11-27 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 27 Nov 2012 13:20:43 +0100

Note that the **curent bet** that the existing 17 planes will be sufficient
is valid only if there's no international desire to encode something else
than just what is in the current focus of Unicode.

Say (for example) that the WIPO absolutely wants to encode corporate logos.
Or ISO or the IETF itself wants to create a set of compact protocol
identifiers (less ambiguous or less limited than existing identifiers that
use basic ASCII strings limited in length), from various registries. Other
trends are also appearing now, with the desire to encode many personal
characters from authors, or many new pictograms from wide collections and
from many more standard bodies, and this may weaken the validity of the
current bet.

All these extensions won't fit in the 17 planes according to the current
encoding policy for the ISO/IEC 10646 and Unicode standards, but they may
start developing their own standard ecoding them in new planes, and may
force ISO to accept the allocation of these "hyperplanes".

For this, a mechanism like UTF-X16 could be used (the PUA-only encoding
will not be suitable), then standardized as a new "extended UTF" (or
UTF-X). It will have also to develop a version for 8 bit encoding, and a
new 32 bit encoding, with limited compatibility with existing UTF-8 and
UTF-32 (the initial designs of UTF-8 and UTF-32 made by ISO could be used
for them, except that they would have new names like UTF-X8, and UTF-X32).

For clarity, even the ISO/IEC 10646 standard should not be used, but
another ISO standard (1n646 ?) developed, with limited compatiblity with
ISO/IEC 10646 (full upward compatibility only, but no direct support for
the backward compatibility without using an upper-layer mechanism, based on
traditionnal registries registering strings, and a defined protocol
language like an XML schema or an escaping language).

2012/11/27 Philippe Verdy <verdy_p_at_wanadoo.fr>

> That's a valid computation if the extension was limited to use only
> 2-surrogate encodings for supplementary planes.
>
> If we could use 3-surrogate encodings, you'd need
> 3*2ˆn surrogates
> to encode
> 2^(3*n)
> new codepoints.
>
> With n=10 (like today), this requires a total of 3072 surrogates, and you
> encode 2^30 new codepoints. This is still possible today, even if the BMP
> is almost full and won't allow a new range of 1024 surrogates: you can
> still use 2 existing surrogates to encode 2048 "hyper-surrogates" in the
> special plane 16 (or for private use in the private planes 14 and 15),
> which will combine with the existing low surrogates in the BMP.
>
> This is not complicate to parse it in the foreward direction, but for the
> backward direction, it means that when you see the final low surrogate, you
> still need to rollback to the previous position: it can only be a leading
> high surrogate of the BMP, **or** (this is would be new) another low
> surrogate encoding, for which you must still get back to get the leading
> high surrogate. This requires a test if starting from a random position,
> but at least it remains possible to know where is the leading high
> surrogate.
>
> One problem of this scheme is that it is not compatible with UTF-16
> because you would find a sequence like:
> <HIGH SURROGATE #1 OF THE BMP, LOW SURROGATE #2 OF THE BMP, LOW
> SURROGATE #3 OF THE BMP>
> which UTF-16 would parse as:
> <VALID SUPPLEMENTARY CODEPOINT FROM SURROGATES(#1,#2), LOW SURROGATE #3 OF
> THE BMP>
>
> The first code point is valid, but for UTF-16 working in strict mode, the
> trailing low surrogate is isolated. It generates an exception (encoding
> error).
>
> But this exception could be handled by varifyng that this isolated low
> surrogate follows a codepoint assigned to one of the 2048
> "hyper-surrogates" allocated in plane 17, or privately in planes 15 or 16,
> in order to encode only private-use codepoints). This would no longer be
> valid UTF-16, but something else (say "UTF-X16").
>
> The **current bet** is that such mechanism will **never** be needed for
> encoding standard codepoints (which will all fit in the existing 17 planes
> (even if 4 of them are almost full and a 5th one will be filled
> significantly for sinograms and a 6th one is allocated only for special
> codepoints but almost empty), only for encoding more private-use codepoints.
>
> But then, if this need is only for encoding many new private codepoints,
> why would we need to encode the final surrogate in the standard range ? You
> can do the same thing by allocating the final surrogate in the private use
> area of the BMP for that usage. Or equivalently by allocating the 3 ranges
> of 1024 private-use surrogates directly in the BMP; the PUA area of the BMP
> is large enough to encode these needed 3072 private-use surrogates to
> support 2ˆ30 new private-use codepoints, and it does not require any
> modification to the existing UTF-16.
>
> In other words, there's no limitation in the number of codepoints for
> private use you can encode with UTF-16. We however depend on the decision
> that the 17 planes will be enough for all standard uses (otherwise a new
> UTF like UTF-X16 above may be standardized, with limited compatibility with
> UTF-16).
>
>
>
> 2012/11/27 "Martin J. Dürst" <duerst_at_it.aoyama.ac.jp>
>
>> Well, first, it is 17 planes (or have we switched to using hexadecimal
>> numbers on the Unicode list already?
>>
>> Second, of course this is in connection with UTF-16. I wasn't involved
>> when UTF-16 was created, but it must have become clear that 2^16 (^ denotes
>> exponentiation ("to the power of")) codepoints (UCS-2) wasn't going to be
>> sufficient. Assuming a surrogate-like extension mechanism, with high
>> surrogates and low surrogates separated for easier synchronization, one
>> needs
>>
>> 2 * 2^n
>> surrogate-like codepoints to create
>>
>> 2^(2*n)
>> new codepoints.
>>
>> For doubling the number of codepoints (i.e. a total of 2 planes), one
>> would use n=8, and so one needs 128 surrogate-like codepoints. With n=9,
>> one gets 4 more planes for a total of 5 planes, and needs 512
>> surrogate-like codepoints. With n=10, one gets 16 more planes (for the
>> current total of 17), but needs 2048 surrogate codepoints. With n=11, one
>> would get 64 more planes for a total of 65 planes, but would need 8192
>> codepoints. And so on.
>>
>> My guess is that when this was considered, 1,048,576 codepoints was
>> thought to be more than enough, and giving up 8192 codepoints in the BMP
>> was no longer possible. As an additional benefit, the 17 planes fit nicely
>> into 4 bytes in UTF-8.
>>
>> Regards, Martin.
>>
>> On 2012/11/26 19:47, Shriramana Sharma wrote:
>>
>>> I'm sorry if this info is already in the Unicode website or book, but
>>> I searched and couldn't find it in a hurry.
>>>
>>> When extending beyond the BMP and the maximum range of 16-bit
>>> codepoints, why was it chosen to go upto 10FFFF and not any more or
>>> less? Wouldn't FFFFF have been the next logical stop beyond FFFF, even
>>> if FFFFFF (or FFFFFFFF) is considered too big? (I mean, I'm not sure
>>> how that extra 64Ki chars [10FFFF minus FFFFF] could be important...)
>>>
>>> Thanks.
>>>
>>>
>>
>
Received on Tue Nov 27 2012 - 06:22:24 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 27 2012 - 06:22:24 CST