Re: Rationale for U+10FFFF?

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 06 2000 - 15:32:53 EST

Next message: John Jenkins: "Re: Paper on the (misnaming) of the Han"
Previous message: mark.davis@us.ibm.com: "Transliteration"
Maybe in reply to: Harald Tveit Alvestrand: "Rationale for U+10FFFF?"
Next in thread: Harald Tveit Alvestrand: "Re: Rationale for U+10FFFF?"
Reply: Harald Tveit Alvestrand: "Re: Rationale for U+10FFFF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Harald Alvestrand asked:

> the current trend in UNICODE/ISO 10646 seems to be to limit the number of
> planes to 17 (U+0 to U+10FFFF).
> Can someone tell me the rationale for not deprecating plane 16, and leave us
> with the much more rational U+0 to U+FFFFF?

The rationale ("rational" or not) is based on preserving -- not changing --
the current standard definition of UTF-16.

As it stands currently, the definition of UTF-16 (in both the Unicode Standard
and ISO/IEC 10646-1:2000) makes use of two sets of 1K code points to
access 1K x 1K code points beyond them BMP. Hence the full encoding is:

BMP (64K) + Planes 1-16 (16 x 64K)

And the scalar values range 0 .. 0x10FFFF.

The upside of deprecating plane 16 would be to make the standard exactly
20 bits, instead of 20 bits and a fraction. That may appeal to one's
aesthetic.

But the downsides of deprecating plane 16 include:

   A. It would eliminate almost half of the available PUA code space.
      For those worried about the question of "Is there enough?" and
      for those with large user-space implementations, that could
      be significant.

   B. It would put a "hole" in UTF-16 implementations. In turn for
      cleaning up the UTF-32 (and scalar value) ranges, you would
      end up having to range check the upper end of the surrogates.
      Instead of the "rational" allocation of the entire 1K x 1K,
      you end up with an "irrational" allocation of only (1K - 64) x 1K
      surrogates, with illegal high surrogate values of U+DBC0..
      U+DBFF that could not be used for anything and would have to
      be range-checked to avoid illegal values.

> Or even allocating the whole bit, and making it U+0 to U+1FFFFF?

This would be even worse. It is no advantage whatsoever over the current
situation. The whole point of the proposed changes are to bring UTF-16,
UTF-8, and UTF-32 into *exact* alignment. Leaving the extra bit is
no different for this purpose than leaving 11 extra bits.

>
> To me, this seems on a par with the ISO session layer that mandated
> sequence number ranges of 0 to 99999 (0x1869F - a 17-bit number) -
> something that will cause readers for tens of years to come to shake their
> heads and say "these guys didn't know what they were doing"; checks for
> legality now need a range compare, not just an AND operation.

Arbitrary stupidity in insisting on decimal ranges in an environment
that lives and breaths hexadecimal -- hence resulting in ignoring
a 16-bit to 17-bit transition where it could have been avoided -- is
one thing. But everybody involved in Unicode is thoroughly versed in
hex -- and we are not talking about a 16-bit/17-bit boundary here --
we are talking about a 20-bit/21-bit boundary. Those are not the same
at all. The defense of the 16-bit boundary was abandoned in Unicode 2.0.
At this point, the difference between a 20-bit, a 20.1 bit, or a 21 bit
boundary comes down to how do we effectively defend the *current*
standard -- which is 20.1 bit. Changing it to eliminate the fractional
bit would upset far more than it would "fix."

> Similar for
> UTF-8 encoders/decoders; this extra plane will haunt implementations for
> years to come.

I don't think so at all. The 20-bit boundary does not correspond to
anything meaningful in a UTF-8 implementation.

U-10000 U+D800 U+DC00 F0 90 80 80
U-FFFFF U+DBBF U+DFFF F3 BF BF BF
U-10FFFF U+DBFF U+DFFF F4 8F BF BF
U-1FFFFF ------------- F7 BF BF BF

U-1FFFFF is the highest UCS-4 value that can be expressed in 4-byte UTF-8.
But it cannot be expressed in UTF-16 (nor can any of Planes 18..31).

U-FFFFF in the highest UCS-4 value that can be expressed in 20 bits, but
both it and U-10FFFF can be expressed in 4-byte UTF-8. From the point of
view of a UTF-8 encoder/decoder, either U-FFFFF or U-10FFFF are equally
arbitrary "top" points. The validation for UTF-16 compatible data (thus
Unicode-compliant) in a UTF-8 encoder/decoder is required for either
case -- with a range check. And changing the top range from U-10FFFF to
U-FFFFF would have no advantage whatsoever in these algorithms. In fact,
on the contrary, it would have a distinct *disadvantage* -- since it would
mean that currently valid algorithms would have to be recoded to respect
the new lower range instead of the currently standard one.

>
> It may not matter much in real CPU time, but to me, it is an offense to the
> aesthetics of the representation.

I guess it depends on your sense of aesthetics. ;-)

--Ken

>
> Comments?
>
> Harald A
>

Next message: John Jenkins: "Re: Paper on the (misnaming) of the Han"
Previous message: mark.davis@us.ibm.com: "Transliteration"
Maybe in reply to: Harald Tveit Alvestrand: "Rationale for U+10FFFF?"
Next in thread: Harald Tveit Alvestrand: "Re: Rationale for U+10FFFF?"
Reply: Harald Tveit Alvestrand: "Re: Rationale for U+10FFFF?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT