Re: Ternary search trees for Unicode dictionaries

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Nov 18 2003 - 12:25:03 EST

Next message: Mark Davis: "Re: Ternary search trees for Unicode dictionaries"

Previous message: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote:
> Unicode is not an issue in this approach, but most of the referenced
> algorithms where based on 8-bit bytes, so for them UTF-8 is a good choice,

Either UTF-8 or UTF-16 would probably work, depending on the rest of the application. UTF-16 may be
better because it fits with many Unicode libraries and APIs that work with 16-bit Unicode strings.

> but SCSU-8 could be used as well, provided that it does not break too much
> the redundancy of substrings for efficient detection and storage of
> non-prefix common substrings.

There is no such thing as SCSU-8.

SCSU is probably not an option because it is not a deterministic encoding. A small change in the
encoder will result in a very different dictionary structure, and/or a failure to find a word. In
order to use SCSU, you would have to write, control and stabilize your own encoder.

BOCU-1 would work if you always start looking up words from their beginning, not subwords from the
middle of a word.

UTF-16 is probably best.

> An issue with SCSU-8 or even UTF-8 is that it still allows performing
> compares within the search/browse algorithm on something not better than
> just the internal binary ordering of encoded sequences (which is really
> distinct from the normal collation order you'd find in a classic
> dictionnary).

If you need linguistically correct comparisons according to some sort order, then you could use the
collation elements, or parts of them, for a word instead of the characters.
http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html#Collation_Element_Iterator

markus

Next message: Mark Davis: "Re: Ternary search trees for Unicode dictionaries"
Previous message: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Philippe Verdy: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Theodore H. Smith: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 13:29:58 EST