Re: Beyond 17 planes, was: Java char and Unicode 3.0+

From: Asmus Freytag (
Date: Thu Oct 16 2003 - 12:31:03 CST

At 08:03 AM 10/16/03 -0700, Peter Kirk wrote:
>Or perhaps a way can be found to graciously retire UTF-16 in some distant
>future version of Unicode. That is likely to become viable long before the
>extra planes are needed.

This discussion is a pure numbers game. Since no-one can define a hard
number for a cut-off that's guaranteed to be good 'forever', all we have is
probability. (That's all we have anyway, whether in life or science). So
the question becomes an estimate of probability.

128 charaters (ASCII) cover 80% of the characters needed by 5% of the
world's population
256 characters (Latin-1) covers 80% of the characters needed by 15% of the
world's population
40,000 characters (Unicode 1.0) covers 95% of the characters needed by 85%
of the worlds population
90,000 characters (Unicode 4.0) covers 98% of the characters needed by 95%
of the world's population

Exercise for the reader:

Where do the other 910,000 characters come from, and who's using them?

If the UTC and WG2 add 1,000 characters per amendment, how many amendments
will it take to fill the remaining space?

[Note: the number of characters accepted so for by UTC for the next
amendment is 684]

Estimate the effect of some number of larger amendments (CJK)?

[Note: account for the possible use of variation selectors to code Han

Given your answers to the previous question, estimate when the BMP will be
completely filled.

[Hint: each WG meeting issues at most one amendment, meetings are at least
six months apart]

Extra credit:
Give a believable estimate for the other 16 planes.


PS: private answer to Jill: make sure that your characters are always
represented internally by infinite precision integers. That way you are not
arbitrarily limited by 32-bit integral data types. ;-)

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST