Efficient Storage and Use of Unicode Property Data

Lloyd Honomichl - INT'L.com

Intended Audience: Software Engineer, Systems Analyst
Session Level: Intermediate

When single byte character sets ruled the earth, C programmers had a very limitedset of character properties to deal with. ANSI C defined a primitive set of 11 character typing functions including isalpha(), iscntrl(), isdigit(), isprint(), islower() and isupper(), along with the simple case mapping funciions toupper() and tolower(). Most runtime libraries stored each property in a single bit, so an array of 256 entries was enough to store all the data needed for the iswhatever() functions. Two more tables of 256 bytes each for upper and lower casing and you had all you needed for a

Now we have a Unicode Character Database that is somewhat more complicated. There are 15 properties in the database for Unicode 3.0. Some of these such as character name, are seldom used in programming logic, but most are needed for proper support of Unicode in modern programs. Many of these properties are not simple on/off bit values. For instance, the bidirectional property can have 11 different values. The Character Decompostion property is variable width. And what does isdigit() mean anyway in the Unicode world?

This paper will describe a method of storing Unicode property data and character mapping tables in a way that is efficient for both size and speed. It will also discuss some of the decision that must be made before creating such tables. A program for creating optional mapping tables will also be made available.

When the world wants to talk, it speaks Unicode
Unicode Standard Program Conference Board WWW9 Talks and Papers Past Conferences
Showcase Registration Accommodation Travel Sponsors Next Conference
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

16 March 2000, Webmaster