Re: OT: programming question

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 24 2007 - 19:01:02 CST

  • Next message: William J Poser: "writing Chinese dialects"

    Oliver Block asked:

    > I've started to read (some chapters of) the Unicode standard last year. I was
    > just curious what is an appropriate way to implement extensive amount of data
    > like character properties.

    The usual ways are by various schemes that compress tables, while
    still giving good lookup speed. One widely used strategy is the
    use of tries:

    http://en.wikipedia.org/wiki/Trie

    In the case of Unicode character properties, the intermediat node keys are
    not strings, but instead tend to be bit partitions of the
    code point values. For example, character properties for
    the 64K code points 0..FFFF can be efficiently accessed by
    dividing the 16 bit values into 8 high bits and 8 low bits,
    and then compressing the parts of the lookup where property
    values are shared for many of the terminal nodes of the resulting table.
    For characters in the range 10000..FFFFF, a different bit
    partition might work better -- for example 10 high bits and
    10 low bits.

    Another common strategy is using bit arrays, and compressing them
    with techniques that drop homogenous ranges of values.

    > In fact the data that need to be stored for timezones is quite extensive, too.

    For time zones, you are talking about the kinds of data which are
    *not* part of the Unicode Standard and the Unicode Character
    Database, but instead are part of all the localization data needed
    to support programs running in different languages, locales,
    time zones, and such. The Unicode Consortium maintains a
    separate standard and a registry of locale data. See the
    Common Locale Data Registry (CLDR):

    http://www.unicode.org/cldr/

    There is also a separate email discussion list for discussing
    issues of locales (including time zones):

    cldr-users@unicode.org

    See:

    http://www.unicode.org/consortium/distlist.html

    for information about that email discussion list and how to
    join it.

    --Ken

     
    > - That's the reason why I asked.



    This archive was generated by hypermail 2.1.5 : Wed Jan 24 2007 - 19:03:41 CST