SCSU/BOCU-1 Compressibility of the Yi syllabary

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jul 14 2005 - 14:04:52 CDT

  • Next message: Donald Z. Osborn: "Questions re ISO-639-1,2,3"

    On 5 December 2004, under the title 'Nicest UTF', Doug Ewell wrote in reply
    to Philippe Verdy, as is archived at
    http://www.unicode.org/mail-arch/unicode-ml/y2004-m12/0104.html :

    >> Only the encoder may be a bit complex to write (if one wants to
    >> generate the optimal smallest result size), but even a moderate
    >> programmer could find a simple and working scheme with a still
    >> excellent compression rate (around 1 to 1.2 bytes per character on
    >> average for any Latin text, and around 1.2 to 1.5 bytes per character
    >> for Asian texts which would still be a good application of SCSU face
    >> to UTF-32 or even UTF-8).

    > If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or Yi
    > syllables, you have no chance of doing better than 2 bytes per character.
    > This is because it is not possible in SCSU to set a dynamic window to any
    > range between U+3400 and U+DFFF, where these characters
    reside. Such a window would be of little use anyway, because real-world
    texts using these characters would draw from so many windows that
    single-byte mode would be less efficient than Unicode mode, where 2 bytes
    per character is the norm. Of course, this is still better than UTF-32 or
    UTF-8 for these characters.

    Has there been any investigation of how badly the Yi syllabary would
    compress under SCSU if dynamic windows were available for it? Actual BOCU-1
    results might give a good indication. With only 0x4C7 syllables, Yi might
    perform better than one might expect. Possible reasons for improvement
    might be:

    1) Both syllables of alliterative compounds would often be in the same SCSU
    (or BOCU-1) window.

    2) Any leakage of ASCII into Yi in single-byte mode would result in the
    ASCII being encoded at one byte per character, rather than two bytes per
    character.

    3) Initial consonants have different frequencies, and so some windows would
    be needed less frequently than others. This would reduce the number of
    dynamic window redefinitions required.

    I'd be happy to do the analysis myself if someone could point me to
    representative Unicode-encoded texts. (I'd do the SCSU test by transposing
    the scalar values from A000 onwards to 2200 onwards.) Of course, the
    quality of a SCSU compressor could make a big difference with a script like
    the Yi syllabary. For example, a simple tweak to my SCSU encoder improved
    Inuktitut (Canadian Aboriginal Syllabics) performance from 1.54 to 1.49
    bytes per character, and my encoder deliberately keeps its state small - one
    byte look-ahead and no statistics.

    Richard.



    This archive was generated by hypermail 2.1.5 : Thu Jul 14 2005 - 14:07:05 CDT