I have been mostly listening here and reading various Unicode/10646 web
sites. I am sorry if the following has been discussed, I did not check the
There is one little thing that I would like to bring into the discussion:
The range of available code points.
It seems that the original idea was to be able to code everything with 16b,
which turned out to be too little.
Now there are proposed characters in planes 1, 2, and 14, and planes 15 and
16 are for private use.
Effectively, code points are used and proposed from 0000 0000 to 0010 ffff,
using about 20.1 bits, or 21b of regular integers.
Full 32b integers are not very popular for storage etc., so surrogate
pairs/UTF-16 are used for the default encoding. In UTF-8, the
four-byte-codes are not used fully.
I like the limitation of the intended range to less than 31b.
However, I think it makes sense to limit it further to just use 20b, i.e.,
code points 0000 0000 to 000f ffff, and not use the private use plane 16.
This should make it easier to have simple codes with just scalar integers,
like a stream of 5 bytes per 2 characters, and it yields a nice, round
number of bits for the scalar values.
One disadvantage may be that there would be illegal codes possible in the
surrogate pairs/UTF-16 encoding (dbc0...dbff in the first 16b word).
With UTF-8, the valid code range could be checked by just looking at the
first byte of a character since exactly half of the 4B-codes would be used.
With the current range, the second byte needs to be examined, too.
If 6400+64k private use characters are not enough, then one of the 11
unused planes could be used instead of plane 16.
PS: If only the 9bit-byte had become the standard...
Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT