Re: "uctype.h": a Unicode-based character classification API

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Tue Feb 10 1998 - 22:32:09 EST


    John> As I am dissatisfied with the constraints that the POSIX "ctype.h"
    John> classification API puts on characters, and its over-specificity
    John> compared with the Unicode model, I have built and tested an
    John> alternative API known as "uctype.h", and I am now releasing the
    John> source code for Version 2.0. (Version 1.0 was differently conceived
    John> and never made it out the door.)

How fortuitous. I prepared much the same thing for release today. No
competition is intended, really. More choice is always a good thing!

I'm calling my package "Unidata 1.0" and here is a short blurb:

Unidata 1.0 is a small package that implements ctype-like operations for
Unicode characters, including the range U-10000 to U-10FFFF. In addition, it
provides case mapping tables and decompositions.

It parses the original UnicodeData-2.0.14.txt (or later) file and generates
three binary data files: one for ctype info, one for case mapping, and one for
decompositions. Adding characters is a matter of adding a Unicode Character
Database format entry in another file and adding that file to the command line
of the parsing program. In fact, I provide a short sample file which adds
some new properties.

The decompositions are generated in fully expanded form, so there is no need
to recursively expand them if you need to use them.

The code works with Traditional C, ISO/IEC C, and C++ compilers, all the data
is stored in 32-bit form (to allow U-10000 to U-10FFFF) and endian swaps are
handled automatically. No unusual data types are used, and architectures that
need data aligned on 4-byte boundaries are taken into account.

The default set of properties (from UnicodeData-2.0.14.txt and the extras I
provide), case mappings, and decompositions end up about 62K total in
well-documented formats that are easy to load and use. External data files
are used to avoid recompiling applications when the data files are updated.
There are ways to make these data files smaller, but that is left as an
exercise for the reader.

As with John's uctype package, this one also has an X11-style copyright which
basically means it can be used by most everybody.

I even have URL's :-)

  [For LF lovers]
  ftp://crl.nmsu.edu/CLR/multiling/unicode/unidata-1.0.tar.gz

  [For CRLF lovers]
  ftp://crl.nmsu.edu/CLR/multiling/unicode/unidat10.zip

The next thing on my agenda is a simple (in the same spirit as Unidata 1.0)
reordering algorithm to get those individual developers interested. I've
discovered developers don't like to wade through thousands of line of my code
just to figure out how to reorder text for presentation. What is their
problem ?:-)

    John> O Sarasvati: I would love to see this code in a
    John> /Public/SOFTWARE/CONTRIB subdirectory. Is this possible? If so,
    John> tell me where and how to upload it. If not, I will post an URL in
    John> due course.

Saravasti, you are welcome to include my package as well. It needs more
parents.

I will be gone the next 4 days, so please don't think I am avoiding answering
your email, even if I am and you don't know it :-)
------------------------------------------------------------------------
Mark Leisher
Computing Research Lab "... I could lard the text with
New Mexico State University hotlinks and hotbuttons ..."
Box 30001, Dept. 3CRL -- Paraphrased from
Las Cruces, NM 88003 -- "Headcrash," Bruce Bethke



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT