Re: Unicode dictionary coding? UTF8, UTF32, etc

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Fri Nov 14 2003 - 19:24:53 EST

Next message: John Cowan: "Re: compatibility characters (in XML context)"

Previous message: Patrick Andries: "How can I input any Unicode character if I know its hexadecimal code?"
In reply to: Theodore H. Smith: "Unicode dictionary coding? UTF8, UTF32, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote:
> Can someone give me some advice? If I was to write a dictionary class
> for Unicode, would I be better off writing it using a b-tree, or
> hash-bin system? Or maybe an array of pointers to arrays system?

See John's reply. Tries of some sort should be good. I think there was a paper at one of the last
two Unicode conferences.

> I suppose, that if I wanted an array of pointers to arrays, that I
> couldn't use UTF32, I could only use UTF8, right? Are there any
> advantages I could make of UTF8 having "dissallowed" character ranges,
> when writing some dictionary code?

The UTF should not matter that much, except that most software that handles Unicode intelligently
uses 16-bit Unicode. Probably more important for the question at hand is the general organization of
your data structure.

> Is there some kind of place on the internet I should look up about
> Unicode dictionaries? I'm assuming that this matter has been dealt with
> in ICU already...

Well, ICU has a dictionary-base break iterator for Thai, but I am not sure that we have good
documentation on the dictionary structure. It's fairly old code, and we are thinking about updating
it using some sort of trie to make it handle larger sets of characters better, and to not just have
a boolean result ("known word") but also associated data (like an indication of relative frequency
of use of the word). Mark may have links on this topic.

markus

Next message: John Cowan: "Re: compatibility characters (in XML context)"
Previous message: Patrick Andries: "How can I input any Unicode character if I know its hexadecimal code?"
In reply to: Theodore H. Smith: "Unicode dictionary coding? UTF8, UTF32, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 14 2003 - 20:23:57 EST