Re: wchar.h, wctype.h question

From: John Cowan (
Date: Wed May 05 1999 - 14:11:58 EDT

G. Adam Stanislav wrote:

> I am a bit confused about the planes in ISO-10646. Where on the web can I
> find a description of these planes?

A plane is a chunk of 65536 characters starting at 0, so
Plane 0 is 0 to 0xFFFF, Plane 1 is 0x10000 to 0x1FFFF, and so on.
The planes are allocated as follows

(hex) purpose
0 Basic Multilingual Plane (BMP)
1 Archaic and esoteric writing systems
2 Rare and ad-hoc CJK characters
3-D Reserved
E IETF language tagging
F-10 Private use
11-7FFF Almost certainly never going to be used for anything

Neither Unicode 2.x nor Unicode 3.0 (in preparation) installs any
non-private characters in planes other than 0.
> Are there any algorithms for the implementation of these functions that I
> should be aware of before trying to reinvent the wheel?

Mark Leisher and I have implemented a common API which could easily
be front-ended with the C standard API. Mark's implementation uses a
binary file a la TZ, so it is easy to extend as the Unicode Standard grows
without impacting applications. Mine uses compact (about 6K)
compiled-in tables, built by a Perl script.

Both implementations use X-style licensing, and I believe you should
adopt either one or the other or both, with whatever mods you want.
Considerable implementation-strategy effort has gone into both.

Mark's implementation is at
        with a patch at .../ucdata-1.9.patch1
Mine is at
        with a patch at .../uctype-2.0.1.patch.txt

We built a new API because we wanted to represent all the categories
of Unicode, which are a much richer set than the Posix ones.
See the file THEORY in my implementation.

> Alas, this
> seems an imperfect solution as there is no way of knowing what future
> extensions will be added to ISO-10646 (I have seen quite a number of
> proposals for such extensions on your web site and there, no doubt, will be
> more).

There is no getting away from this problem: Unicode, unlike typical
8-bit coded character sets, is inherently extensible. New characters
will go on being added for years. Mark's implementation assumes
the existence and accessibility of a file; mine is for space-and-speed-tight
situations where you either don't mind compiling or are willing to
live with "unknown character" situations for new characters.

> Is there a better way? Is there a system to this? What I mean is, is there
> some way of knowing that if for example a specific bit in character code is
> set, it is a digit? Or if another bit is set, it is an alphabetic letter?

No. Tables are inevitable, and the question is, how should they be
compacted? My implementation represents a Plane 0 character as a
bit-vector of size 32 (i.e. a long) specifying its Unicode properties.

Then each *distinct* bit-vector (there are less than 512 of them)
is stored in a table, so a 9-bit index into the table can
represent the bit-vector. I then generate a run-length-encoding
of successive indexes, using 7 bits for each length, and compute 512
useful offsets (one for every 128-character half-row) into the
run-length-encoding table so that it does not have to be searched very

Feel free to contact me privately for further assistance.

John Cowan
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT