Re: Code pages and Unicode

From: Ken Whistler <kenw_at_sybase.com>
Date: Wed, 24 Aug 2011 17:07:03 -0700

On 8/24/2011 3:51 PM, Richard Wordingham wrote:
>> Well, in that case, the correct action is to work to ensure that code
>> > points are not squandered.
> Have there not already been several failures on that front? The BMP is
> littered with concessions to the limitations of rendering systems -
> precomposed characters, Hangul syllables and Arabic presentation forms
> are the most significant.

Those are not concessions to "the limitations of rendering systems" -- they
are concessions to the need to stay compatible with the character encodings
of legacy systems, which had limitations for their rendering systems.

A quibble? I think not.

Note the outcome for Tibetan, for example. A proposal came in some years
ago to encode "all" of
the stacks for Tibetan as separate, precomposed characters -- ostensibly
because
of the limitations of rendering systems. That proposal was stopped dead
in its
tracks in the encoding committees, both because it would have been a
duplicate
encoding and normalization nightmare, and because, well, current rendering
systems *can* render Tibetan just fine, thank you, given the current
encoding.

> Hangul syllables being also a political
> compromise

 From *1995*, when such a compromise was necessary to keep in place
the still fragile consensus which had driven 10646 and the Unicode Standard
into a still-evolving coexistence.

It is a mistake to extrapolate from that one example to conclusions that
political decisions will inevitably lead to encoding useless additional
hundreds of thousands of characters.

> does not instil confidence in the lines of defence. I don't
> dispute that there have also been victories. Has Japanese
> disunification been completely killed, or merely scotched?
>
>>> > > I think, however, that<high><high><rare
>>> > > BMP code><low> offers a legitimate extension mechanism
>> > One could argue about the description as "legitimate". It is clearly
>> > not conformant,
> With what? It's obviously not UTF-16 as we know it, but a possibly new
> type of code-unit sequence.

In whichever encoding form you choose to specify, the sequence <high><high>
is non-conformant. Not merely a possibly new type of code unit sequence.

<D800 D800> is non-conformant UTF-16

<0000D800 0000D800> is non-conformant UTF-32

<ED A0 80 ED A0 80> is non-conformant UTF-8

>
>> > and would require a decision about an architectural change to the
>> > standard.
> Naturally. The standard says only 17 planes. However, apart from
> UTF-16, the change to the*standard* would not be big. (Even so, a lot
> of UTF-8 and UTF-32 code would have to be changed to accommodate the new
> limit.)

Which is why this is never going to happen. (And yes, I said "never". ;-) )

>> > I see no chance of that happening for either the Unicode
>> > Standard or 10646.
> It will only happen when the need becomes obvious, which may be never,
> or may be 30 years hence. It's even conceivable that UTF-16 will
> drop out of use.

Could happen. It still doesn't matter, because such a proposal also breaks
UTF-8 and UTF-32.

>
>> > Plane 0: 63,207 / 65,536 = 96.45% full
>>
> I only see two planes that are actually full. Which are you counting
> as the full non-PUA plane?

The BMP. 96.45% full is, for all intents and purposes, considered "full"
now.

If you look at the BMP roadmap:

http://www.unicode.org/roadmaps/bmp/

there are only 9 columns left which are not already in assigned blocks.
More characters
will gradually be added to existing blocks, of course, filling in nooks
and crannies, but
the real action for new encoding has now turned almost entirely to Plane 1.

--Ken
Received on Wed Aug 24 2011 - 19:10:03 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 24 2011 - 19:10:04 CDT