From: Doug Ewell (firstname.lastname@example.org)
Date: Sun Dec 05 2004 - 23:26:19 CST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> Only the encoder may be a bit complex to write (if one wants to
> generate the optimal smallest result size), but even a moderate
> programmer could find a simple and working scheme with a still
> excellent compression rate (around 1 to 1.2 bytes per character on
> average for any Latin text, and around 1.2 to 1.5 bytes per character
> for Asian texts which would still be a good application of SCSU face
> to UTF-32 or even UTF-8).
If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or
Yi syllables, you have no chance of doing better than 2 bytes per
character. This is because it is not possible in SCSU to set a dynamic
window to any range between U+3400 and U+DFFF, where these characters
reside. Such a window would be of little use anyway, because real-world
texts using these characters would draw from so many windows that
single-byte mode would be less efficient than Unicode mode, where 2
bytes per character is the norm. Of course, this is still better than
UTF-32 or UTF-8 for these characters.
For Katakana and Hiragana, you can get the same efficiency with SCSU as
for other small scripts, but very few texts are written in pure kana
except for young children.
Sorry for missing this point in my earlier post.
(*) No, I'm not interested in arguing over this word.
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 23:29:44 CST