A binary file format for storing character properties

From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Tue May 04 2004 - 06:09:07 CDT


At this time there are about 160 different character properties defined
in the UCD. In practice most applications probably only use a limited set
of properties to work with. Nevertheless applications should be able to
lookup all the properties of a code point. Compiling-in lookup tables for
all defined properties (including Unihan) makes small applications become
rather big. This made me decide to create a binary file format for storing
character properties and initialize property lookup tables on demand.

Benefits of using run-time loadable lookup tables initialized from binary
files are:

   - no worries about total table size, since data will only be loaded
     on demand

   - initializing lookup tables from a binary file is relatively fast

   - property lookup files can be locale specific (useful for character
     names and case mappings for example)

   - new properties can be added quickly and never affect layout or
     content of other tables

   - any number of properties can be supported including custom
     (non-Unicode) properties

   - by initializing a lookup table from two sources (UCD-based and
     vendor-based), applications can overload the default property
     values assigned to PUA characters with private property values

The file format I've implemented is capable of storing any type of property.
Each file contains property values for one property (no more squeezing as
much property values as possible in as few bits as possible). The format
is called UPR (Unicode PRoperties).

I have written a tool to generate the necessary UPR files from the UCD. A
small C-library for reading a UPR file into a property lookup table, and
a high-level library which provides property lookup functions for *all*
Unicode properties in 4.0.0 are also available.

For more information on the file format and related software see:
http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. My primary
development platform is UNIX/Linux, but you can compile and run it under
Windows as well (less tested however). Current version supports UCD 4.0.0,
I will add support for 4.0.1 soon.

Please check it out. Feedback is welcome.

Regards,

Theo Veenker



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT