Re: Ternary search trees for Unicode dictionaries

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Nov 18 2003 - 12:25:03 EST

  • Next message: Mark Davis: "Re: Ternary search trees for Unicode dictionaries"

    Philippe Verdy wrote:
    > Unicode is not an issue in this approach, but most of the referenced
    > algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,

    Either UTF-8 or UTF-16 would probably work, depending on the rest of the application. UTF-16 may be
    better because it fits with many Unicode libraries and APIs that work with 16-bit Unicode strings.

    > but SCSU-8 could be used as well, provided that it does not break too much
    > the redundancy of substrings for efficient detection and storage of
    > non-prefix common substrings.

    There is no such thing as SCSU-8.

    SCSU is probably not an option because it is not a deterministic encoding. A small change in the
    encoder will result in a very different dictionary structure, and/or a failure to find a word. In
    order to use SCSU, you would have to write, control and stabilize your own encoder.

    BOCU-1 would work if you always start looking up words from their beginning, not subwords from the
    middle of a word.

    UTF-16 is probably best.

    > An issue with SCSU-8 or even UTF-8 is that it still allows performing
    > compares within the search/browse algorithm on something not better than
    > just the internal binary ordering of encoded sequences (which is really
    > distinct from the normal collation order you'd find in a classic
    > dictionnary).

    If you need linguistically correct comparisons according to some sort order, then you could use the
    collation elements, or parts of them, for a word instead of the characters.
    http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html#Collation_Element_Iterator

    markus



    This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 13:29:58 EST