Re: Finite state machines? UTF8: toFold(), normalisation, etc

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue May 06 2003 - 12:28:07 EDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Dot shapes varying fontwise (was: "Re: The quick brown fox jumps over the lazy dog")"

    Theodore H. Smith wrote:
    > I'm unfamiliar with "trie".

    Some pointers to ICU source code. You can find these in the download or via WebCVS. See
    http://oss.software.ibm.com/icu/download/ and http://oss.software.ibm.com/icu/develop/cvs.html
    For the latter, just append the pathnames below to http://oss.software.ibm.com/cvs/icu/~checkout~/icu/

    ICU uses an internal API "UTrie" for storing several data structures. See source/common/utrie.h and
    utrie.c. Note that this is _internal_ because it is not easy to use. The hard part is understanding
    the "folding function" that you need to provide; we have an RFE to add default folding functions. If
    you want to use it, then the best is to look at its usage across ICU. The presentations that others
    pointed to also explain it a bit.

    The "ICU Data" chapter of our User Guide contains at its bottom a table which points to where the
    binary data formats are described for how we store character properties, normalization data, etc.
    See http://oss.software.ibm.com/icu/userguide/icudata.html

    The character properties APIs are implemented in source/common/uchar.c and uprops.c.

    Note that you can build ICU with many features turned off to reduce the library size, and build the
    data library with many or most items omitted. It will still be larger than 54kB though...

    Hope this helps,
    markus



    This archive was generated by hypermail 2.1.5 : Tue May 06 2003 - 13:30:01 EDT