Re: Ternary search trees for Unicode dictionaries

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Nov 19 2003 - 19:14:27 EST

  • Next message: Markus Scherer: "Re: creating a test font w/ CJKV Extension B characters."

    Philippe Verdy wrote:
    > From: "Mark Davis" <mark.davis@jtcsv.com>
    >
    >>Sorry, I read your message too quickly; you are right -- tries are for
    >> character
    >>lookup, not strings. Tom Emerson of Basis gave a very nice talk on data
    >>structures for string lookup at the last UTC; you should contact him.
    >
    > I see abosolutely no reason why a trie cannot be used for string lookup.

    Confusion of language.

    Generally, tries perform a lookup on consecutive units of an input string of some sort. As such,
    they can be designed for variable-length input, e.g. for dicionaries.

    Mark is (and I am) most used to working with tries that are optimized for single Unicode code points
    as input. A 21-bit code point may, for example, be segmented into a "string" of constant length 3,
    with the first unit from bits 20..10 of the code point, the second unit from bits 9..4, the 3rd from
    3..0. This "string" is then looked up. Further, each unit is used to directly access an array
    instead of performing a search, for best performance. Of course, these particular, specialized tries
    are not useful for anything but looking up Unicode code points.

    markus



    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 20:21:47 EST