Re: Nicest UTF

From: Doug Ewell
Date: Sun Dec 05 2004 - 23:26:19 CST

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Only the encoder may be a bit complex to write (if one wants to
    > generate the optimal smallest result size), but even a moderate
    > programmer could find a simple and working scheme with a still
    > excellent compression rate (around 1 to 1.2 bytes per character on
    > average for any Latin text, and around 1.2 to 1.5 bytes per character
    > for Asian texts which would still be a good application of SCSU face
    > to UTF-32 or even UTF-8).

    If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or
    Yi syllables, you have no chance of doing better than 2 bytes per
    character. This is because it is not possible in SCSU to set a dynamic
    window to any range between U+3400 and U+DFFF, where these characters
    reside. Such a window would be of little use anyway, because real-world
    texts using these characters would draw from so many windows that
    single-byte mode would be less efficient than Unicode mode, where 2
    bytes per character is the norm. Of course, this is still better than
    UTF-32 or UTF-8 for these characters.

    For Katakana and Hiragana, you can get the same efficiency with SCSU as
    for other small scripts, but very few texts are written in pure kana
    except for young children.

    Sorry for missing this point in my earlier post.

    -Doug Ewell
     Fullerton, California
     (*) No, I'm not interested in arguing over this word.

