Re: wchar.h, wctype.h question

From: John Cowan (cowan@locke.ccil.org)
Date: Wed May 05 1999 - 14:11:58 EDT


G. Adam Stanislav wrote:

> I am a bit confused about the planes in ISO-10646. Where on the web can I
> find a description of these planes?

A plane is a chunk of 65536 characters starting at 0, so
Plane 0 is 0 to 0xFFFF, Plane 1 is 0x10000 to 0x1FFFF, and so on.
The planes are allocated as follows

(hex) purpose
0 Basic Multilingual Plane (BMP)
1 Archaic and esoteric writing systems
2 Rare and ad-hoc CJK characters
3-D Reserved
E IETF language tagging
F-10 Private use
11-7FFF Almost certainly never going to be used for anything

Neither Unicode 2.x nor Unicode 3.0 (in preparation) installs any
non-private characters in planes other than 0.
 
> Are there any algorithms for the implementation of these functions that I
> should be aware of before trying to reinvent the wheel?

Mark Leisher and I have implemented a common API which could easily
be front-ended with the C standard API. Mark's implementation uses a
binary file a la TZ, so it is easy to extend as the Unicode Standard grows
without impacting applications. Mine uses compact (about 6K)
compiled-in tables, built by a Perl script.

Both implementations use X-style licensing, and I believe you should
adopt either one or the other or both, with whatever mods you want.
Considerable implementation-strategy effort has gone into both.

Mark's implementation is at
        ftp://crl.nmsu.edu/CLR/multiling/unicode/unidata-1.9.tar.gz
        with a patch at .../ucdata-1.9.patch1
Mine is at http://www.ccil.org/~cowan/uctype-2.0.tar.gz
        with a patch at .../uctype-2.0.1.patch.txt

We built a new API because we wanted to represent all the categories
of Unicode, which are a much richer set than the Posix ones.
See the file THEORY in my implementation.

> Alas, this
> seems an imperfect solution as there is no way of knowing what future
> extensions will be added to ISO-10646 (I have seen quite a number of
> proposals for such extensions on your web site and there, no doubt, will be
> more).

There is no getting away from this problem: Unicode, unlike typical
8-bit coded character sets, is inherently extensible. New characters
will go on being added for years. Mark's implementation assumes
the existence and accessibility of a file; mine is for space-and-speed-tight
situations where you either don't mind compiling or are willing to
live with "unknown character" situations for new characters.

> Is there a better way? Is there a system to this? What I mean is, is there
> some way of knowing that if for example a specific bit in character code is
> set, it is a digit? Or if another bit is set, it is an alphabetic letter?

No. Tables are inevitable, and the question is, how should they be
compacted? My implementation represents a Plane 0 character as a
bit-vector of size 32 (i.e. a long) specifying its Unicode properties.

Then each *distinct* bit-vector (there are less than 512 of them)
is stored in a table, so a 9-bit index into the table can
represent the bit-vector. I then generate a run-length-encoding
of successive indexes, using 7 bits for each length, and compute 512
useful offsets (one for every 128-character half-row) into the
run-length-encoding table so that it does not have to be searched very
far.

Feel free to contact me privately for further assistance.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT