Case Table Compresison Assumptions (was: RE: Posting Links to Ballots (was: RE: Why blackletter letters?))

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Fri, 13 Sep 2013 22:41:37 +0000

Steffen,

FYI, Unicode 7.0, when it comes out, will have another entire
bicameral (casing) script added to it: Warang Citi. And when
Old Hungarian is finally published, at some point after Unicode 7.0,
that will be *another* bicameral script added. It is unlikely that those
two will be the last. And those are in addition to the continual trickle
of case pairs to already existing bicameral scripts like Latin and
Cyrillic.

It is a false economy for a general Unicode library implementation
to be overly clever about how it compresses tables, such as casing
tables. That approach can get you into trouble when something else is
added to the standard which breaks your initial assumptions.

If you want to do this kind of thing, my suggestion would be
instead to do a two-step process: first implement a general
table which can always be easily updated based on new additions
to UnicodeData.txt (and/or SpecialCasing.txt and CaseFolding.txt,
depending on what kind of case tables you are implementing),
and which doesn't worry too much about table size. Then
write a *separate* optimization step which can compress
your generic table format into a more compact format.
If you do it that way, your adaptation to future additions to
the standard can be much more robust, while still optimizing
for minimal table size.

--Ken

>
>
> I have been able to compress all lower-, upper- and titlecase
> mappings, simple and extended (no conditions yet) of Unicode 6.2
> into a 260 entry binary search array.
> I'm not with this project at the moment, but looking at the
> alloc/Pipeline.html it *could* be that those few characters alone
> will add maybe 10 (sorry..) more slots, if the presence of SMALL
> or CAPITAL indicates they'll be Lt/Lu/Ll or will have an entry in
> `SpecialCasing.txt'.
> I hope that this wonderful thing that is the UCS will not become
> blurred -- memory size is still a concern for some people.
> (Reading how the process works doesn't give a lot of hope, yet
> that is what came to my mind.)
> Ciao,
>
> --steffen
Received on Fri Sep 13 2013 - 17:44:27 CDT

This archive was generated by hypermail 2.2.0 : Fri Sep 13 2013 - 17:44:28 CDT