RE: PDUTR #27: Unicode 3.1

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 22 2001 - 14:34:06 EST


Peter Constable replied to an inquiry from Mike Lischke:

> >I'm currently extending my Unicode library for Delphi and I wonder how to
> >deal with all those new characters which are > 64K? Particularly the
> >character properties causing me headaches because of the memory
> >requirements. Using a linear search through the categories and a binary
> >search through all ranges in that category is one solution to minimize
> >memory footprint but it is also very slow. Do you know any better
> >implementation which I could look through to learn a better way?
>
> These are the kinds of issues Mark Davis discussed in his "Bits of Unicode"
> presentation at the last Unicode conference (IUC 17). See
> http://www.unicode.org/iuc/iuc17/papers.html and look for session B3.

I agree that Mark Davis' discussion covers many of the tricks to make things
small *and* fast when dealing with Unicode tables.

However, you can start out with relatively simple approaches and still
get excellent performance in both memory and speed. For example, I
recently extended my own Sybase Unicode library implementation to
cover the entire Unicode 3.1 repertoire (44,946 new characters). For the
character properties table, which tracks 36 distinct character and/or
code point properties, along with all the bidi values, the extension
of the table to cover all 44,946 new supplementary characters (off
the BMP) took exactly 2711 ints (adding 10,844 bytes to the table).

That's 10,844 bytes to track 55 properties for 44,946 characters.

I do this with a simple 10-bit/10-bit two-stage trie of variable
size array table chunks. Nothing very complex, hard to program or
hard to maintain. And the access is lightning fast. Accessing a character
property for a supplementary character expressed in UTF-16 via a
10-bit/10-bit trie optimized for surrogate pairs is almost as
fast as looking up the property for a BMP character. Since all the
properties are stored in two-stage tries, both lookups involve two
bit masking operations and two table lookup steps, but for
supplementary characters, you just mask on the two surrogate code
points, rather than on two nibbles of the BMP code point. The only
extra step is recognizing that you are dealing with a surrogate pair
in the first place -- and you would have to do that for *any* lookup
scheme you envision.

I'm sure the dedicated bit-twiddlers could improve my table size by
a factor of two, easily, since even at that size there is a fair
amount of redundancy in the table, but my own design preference is
to keep the design simple and the table access simple and fast. The
10K delta in table size is chump change in the context of the kind
of software I am supporting these days.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT