Re: Ternary search trees for Unicode dictionaries

From: Doug Ewell (dewell@adelphia.net)
Date: Fri Nov 21 2003 - 02:49:46 EST

  • Next message: Doug Ewell: "Re: UTF-16 inside UTF-8"

    Jungshik Shin <jshin at mailaps dot org> wrote:

    >> In my experience, SCSU usually does perform somewhat better than
    >> BOCU-1, but for some scripts (e.g. Korean) the opposite often seems
    >> to be true.
    >
    > Just out of curiosity, which NF did you use for your uncompressed
    > source Korean text, NFC or NFD when you got the above result?
    > I guess I'll know in a week or so when your paper is out, but...

    It was actually Steven Atkin's and Ryan Stansifer's test, not mine,
    although I did reproduce their results. The file they used, called
    "arirang.txt," contains over 3.3 million Unicode characters and was
    apparently once part of their "Florida Tech Corpus of Multi-Lingual
    Text" but subsequently deleted for reasons not known to me. I can
    supply it if you're interested.

    The file is all in syllables, not jamos, which I guess means it's in
    NFC.

    The statistics on this file are as follows:

    UTF-16 6,634,430 bytes
    UTF-8 7,637,601 bytes
    SCSU 6,414,319 bytes
    BOCU-1 5,897,258 bytes
    Legacy encoding (*) 5,477,432 bytes
        (*) KS C 5601, KS X 1001, or EUC-KR)

    I used my own SCSU encoder to achieve these results, but it really
    wouldn't matter which was chosen -- Korean syllables can be encoded in
    SCSU *only* by using Unicode mode. It's not possible to set a window to
    the Korean syllable range. Only the large number of spaces and full
    stops in this file prevented SCSU from degenerating entirely to 2 bytes
    per character.

    The creators of BOCU-1 (Davis and Scherer) also reported better
    performance on Korean text for BOCU-1 than for SCSU (this was actually
    the only script for which this could be said). They used the Korean
    "What is Unicode?" page, which is also written in syllables rather than
    jamos.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri Nov 21 2003 - 03:33:31 EST