Re: SCSU/BOCU-1 Compressibility of the Yi syllabary

From: Doug Ewell (
Date: Fri Jul 15 2005 - 02:50:49 CDT

  • Next message: Johannes Bergerhausen: "design prototype: the ultimate unicode keyboard?"

    Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

    >> If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul,
    >> or Yi syllables, you have no chance of doing better than 2 bytes per
    >> character. This is because it is not possible in SCSU to set a
    >> dynamic window to any range between U+3400 and U+DFFF, where these
    >> characters reside. Such a window would be of little use anyway,
    >> because real-world texts using these characters would draw from so
    >> many windows that single-byte mode would be less efficient than
    >> Unicode mode, where 2 bytes per character is the norm. Of course,
    >> this is still better than UTF-32 or UTF-8 for these characters.
    > Has there been any investigation of how badly the Yi syllabary would
    > compress under SCSU if dynamic windows were available for it? Actual
    > BOCU-1 results might give a good indication. With only 0x4C7
    > syllables, Yi might perform better than one might expect. Possible
    > reasons for improvement might be:
    > 1) Both syllables of alliterative compounds would often be in the same
    > SCSU (or BOCU-1) window.

    SCSU does not allow the setting of a dynamic window anywhere within the
    Yi range (U+A000 through U+A4C6). The only way to encode Yi text in
    SCSU is to use "Unicode mode," encoding each character in 2 bytes (MSB,
    LSB). This is stated in the text you quoted.

    It's possible that some sequences of Yi might benefit from being
    encodable in a dynamic window, but since it is not possible to do so,
    the point is moot.

    > 2) Any leakage of ASCII into Yi in single-byte mode would result in
    > the ASCII being encoded at one byte per character, rather than two
    > bytes per character.

    Sufficiently long sequences of ASCII characters might justify a switch
    out of Unicode mode into single-byte mode, where the compression thus
    gained would be justified.

    > I'd be happy to do the analysis myself if someone could point me to
    > representative Unicode-encoded texts. (I'd do the SCSU test by
    > transposing the scalar values from A000 onwards to 2200 onwards.) Of
    > course, the quality of a SCSU compressor could make a big difference
    > with a script like the Yi syllabary. For example, a simple tweak to
    > my SCSU encoder improved Inuktitut (Canadian Aboriginal Syllabics)
    > performance from 1.54 to 1.49 bytes per character, and my encoder
    > deliberately keeps its state small - one byte look-ahead and no
    > statistics.

    This is different, because a SCSU window can be set to the Canadian
    Syllabics range. Likewise for Ethiopic.

    Doug Ewell
    Fullerton, California

    This archive was generated by hypermail 2.1.5 : Fri Jul 15 2005 - 02:52:33 CDT