RE: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (
Date: Sun Nov 23 2003 - 07:16:36 EST

  • Next message: Christopher John Fynn: "Re: How can I have OTF for MacOS"

    Jungshik Shin writes* on sun 23-nov-2003 03:51 to Doug Ewell:

    > On Thu, 20 Nov 2003, Doug Ewell wrote:
    > > Jungshik Shin <jshin at mailaps dot org> wrote:
    > >
    > > The file is all in syllables, not jamos, which I guess means it's in
    > > NFC.
    > Yes, it's in NFC, then.
    > > The statistics on this file are as follows:
    > >
    > > UTF-16 6,634,430 bytes
    > > UTF-8 7,637,601 bytes
    > > SCSU 6,414,319 bytes
    > > BOCU-1 5,897,258 bytes
    > > Legacy encoding (*) 5,477,432 bytes
    > > (*) KS C 5601, KS X 1001, or EUC-KR)
    > > Only the large number of spaces and full
    > > stops in this file prevented SCSU from degenerating entirely to 2 bytes
    > > per character.
    > That's why I asked. What I'm curious about is how SCSU and BOCU
    > of NFD (and what I and Kent [2] think should have been NFD with
    > the possible
    > code point rearragement of Jamo block to facilate a smaller window size
    > for SCSU) would compare with uncompressed UTF-16 of NFC (SCSU/BOCU isn't
    > much better than UTF-16). The back of an envelope calculation gives me
    > 2.5 ~ 3 bytes per syllable (without the code point rearrangement to put
    > them within a 64 character-long window [1]) so it's still worse than
    > UTF-16. However, that's not as bad as ~5 bytes (or more) per syllable
    > without SCSU/BOCU-1. I have to confess that I just have a very cursory
    > understanding of SCSU/BOCU-1.
    > As for why I'm interested in NFD, one of the reason is that it is easier
    > to implement incremental search (among many other things) in terms of
    > NFD. The only __editor__ I've ever seen support incremental search for
    > Korean per letter (as opposed to per syllable) is Hemacs (Hangul Emacs).
    > Not having incremental search per letter in Korean is like not being able
    > to match 'apples/apple/apply', 'interesting/interested', and 'learned'
    > with 'appl', 'interest' and 'learn'.

    For Korean text, I have found that representation with "defective" syllables
    was performing better through SCSU. I mean here decomposing the TLV
    syllables of the NFC form into T and LV, and TL into T and L, i.e. with
    partial decomposition.

    Then the text only contains T or L or V jamos, or LV syllables (which is
    only a small fragment of all the encoded Hangul syllables), and the text has
    this form:
            ( ((T* L* LV V*) - (T* L+ V)) | X )*
    where X are other non Hangul characters (like Z-separators, Punctuations,
    Numbers, Symbols and C-others, or some ideographic Letters.) Once can get it
    by computing the standard NFC algorithm but with supplementary canonical
    combining exclusions for TL and TLV characters.

    With this constraint, Korean is no more acting like Han, and the precombined
    arrangements of LV syllables saves much on the SCSU window; gains are also
    significant for for other binary compressors like LZW on any UTF scheme, and
    even with Huffman or Arithmetic coding of UTF-16*/UTF-32* schemes.

    Performance for incremental search is also better than full decomposition,
    as there's still a significant reduction of the number of codepoints face to
    NFD's full decomposition.

    However I did not test this on very large text files like the ones you
    propose here. May be you could test it.

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 07:59:41 EST