Re: Unicode dictionary coding? UTF8, UTF32, etc

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Nov 14 2003 - 19:24:53 EST

  • Next message: John Cowan: "Re: compatibility characters (in XML context)"

    Theodore H. Smith wrote:
    > Can someone give me some advice? If I was to write a dictionary class
    > for Unicode, would I be better off writing it using a b-tree, or
    > hash-bin system? Or maybe an array of pointers to arrays system?

    See John's reply. Tries of some sort should be good. I think there was a paper at one of the last
    two Unicode conferences.

    > I suppose, that if I wanted an array of pointers to arrays, that I
    > couldn't use UTF32, I could only use UTF8, right? Are there any
    > advantages I could make of UTF8 having "dissallowed" character ranges,
    > when writing some dictionary code?

    The UTF should not matter that much, except that most software that handles Unicode intelligently
    uses 16-bit Unicode. Probably more important for the question at hand is the general organization of
    your data structure.

    > Is there some kind of place on the internet I should look up about
    > Unicode dictionaries? I'm assuming that this matter has been dealt with
    > in ICU already...

    Well, ICU has a dictionary-base break iterator for Thai, but I am not sure that we have good
    documentation on the dictionary structure. It's fairly old code, and we are thinking about updating
    it using some sort of trie to make it handle larger sets of characters better, and to not just have
    a boolean result ("known word") but also associated data (like an indication of relative frequency
    of use of the word). Mark may have links on this topic.

    markus



    This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 20:23:57 EST