Re: Ternary search trees for Unicode dictionaries

From: Jungshik Shin (
Date: Sat Nov 22 2003 - 21:50:44 EST

  • Next message: Philippe Verdy: "numeric properties of Nl characters in the UCD"

    On Thu, 20 Nov 2003, Doug Ewell wrote:

    > Jungshik Shin <jshin at mailaps dot org> wrote:
    > >> In my experience, SCSU usually does perform somewhat better than
    > >> BOCU-1, but for some scripts (e.g. Korean) the opposite often seems
    > >> to be true.
    > >
    > > Just out of curiosity, which NF did you use for your uncompressed
    > > source Korean text, NFC or NFD when you got the above result?
    > > I guess I'll know in a week or so when your paper is out, but...

    Thanks a lot for providing the details.

    > although I did reproduce their results. The file they used, called
    > "arirang.txt," contains over 3.3 million Unicode characters and was
    > apparently once part of their "Florida Tech Corpus of Multi-Lingual
    > Text" but subsequently deleted for reasons not known to me. I can
    > supply it if you're interested.

      It'd be great if you could.

    > The file is all in syllables, not jamos, which I guess means it's in
    > NFC.

      Yes, it's in NFC, then.

    > The statistics on this file are as follows:
    > UTF-16 6,634,430 bytes
    > UTF-8 7,637,601 bytes
    > SCSU 6,414,319 bytes
    > BOCU-1 5,897,258 bytes
    > Legacy encoding (*) 5,477,432 bytes
    > (*) KS C 5601, KS X 1001, or EUC-KR)

      Sorry to pick on this (when I have to thank you). Even with
    coded character set vs character encoding scheme distinction aside
    (that is, we just think in terms of character repertoire), KS C 5601/KS
    X 1001 _alone_ cannot represent any Korean text unless you're willing
    to live with double width space, Latin letters, numbers and punctuations
    (since you wrote the file has apparently full stops and spaces in ASCII,
    it does include characters outside KS X 1001) On the other hand, EUC-KR
    (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I suspect the legacy
    encoding used was Windows codepage 949(or JOHAB/Windows-1361?) because
    I can't imagine there is not a single syllable (that is outside the
    charater repertoire of KS X 1001) out of over 2 million syllables

    > I used my own SCSU encoder to achieve these results, but it really
    > wouldn't matter which was chosen -- Korean syllables can be encoded in
    > SCSU *only* by using Unicode mode. It's not possible to set a window to
    > the Korean syllable range.

      Now that you told me you used NFC, isn't this condition similar to
    Chinese text? How does BOCU and SCSU work for Chinese text? Japanese text
    might do slightly better with Kana, but isn't likely to be much better.

    > Only the large number of spaces and full
    > stops in this file prevented SCSU from degenerating entirely to 2 bytes
    > per character.

       That's why I asked. What I'm curious about is how SCSU and BOCU
    of NFD (and what I and Kent [2] think should have been NFD with the possible
    code point rearragement of Jamo block to facilate a smaller window size
    for SCSU) would compare with uncompressed UTF-16 of NFC (SCSU/BOCU isn't
    much better than UTF-16). The back of an envelope calculation gives me
    2.5 ~ 3 bytes per syllable (without the code point rearrangement to put
    them within a 64 character-long window [1]) so it's still worse than
    UTF-16. However, that's not as bad as ~5 bytes (or more) per syllable
    without SCSU/BOCU-1. I have to confess that I just have a very cursory
    understanding of SCSU/BOCU-1.

     As for why I'm interested in NFD, one of the reason is that it is easier
    to implement incremental search (among many other things) in terms of
    NFD. The only __editor__ I've ever seen support incremental search for
    Korean per letter (as opposed to per syllable) is Hemacs (Hangul Emacs).
    Not having incremental search per letter in Korean is like not being able
    to match 'apples/apple/apply', 'interesting/interested', and 'learned'
    with 'appl', 'interest' and 'learn'.


    [1] By encoding only basic consonants and vowels, we could have put
    all Hangul letters in a 64 character block. With all modern consonants
    and vowels encoded (whether it's complex or not) and only basic
    consonants/vowels encoded for archaic letters, it'd be slightly larger
    than to fit into a 64 character block. If we just encode consonants
    once (instead of twice - as leading and trailing, which is similar to
    Tibetan case), we can make it even more compact.

    [2] I found that KIM Kyeong-seok (Korean representative to ISO/IEC
    JTC1/SC22/WG20 and JTC1/SC2/WG2) had also proposed to change the
    NFD similarly. I'm not sure why it's submitted to JTC1 instead of
    UTC (NOT that we can remedy the problem now)

    This archive was generated by hypermail 2.1.5 : Sat Nov 22 2003 - 22:43:19 EST