From: Doug Ewell (email@example.com)
Date: Fri Jul 15 2005 - 02:50:49 CDT
Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>> If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul,
>> or Yi syllables, you have no chance of doing better than 2 bytes per
>> character. This is because it is not possible in SCSU to set a
>> dynamic window to any range between U+3400 and U+DFFF, where these
>> characters reside. Such a window would be of little use anyway,
>> because real-world texts using these characters would draw from so
>> many windows that single-byte mode would be less efficient than
>> Unicode mode, where 2 bytes per character is the norm. Of course,
>> this is still better than UTF-32 or UTF-8 for these characters.
> Has there been any investigation of how badly the Yi syllabary would
> compress under SCSU if dynamic windows were available for it? Actual
> BOCU-1 results might give a good indication. With only 0x4C7
> syllables, Yi might perform better than one might expect. Possible
> reasons for improvement might be:
> 1) Both syllables of alliterative compounds would often be in the same
> SCSU (or BOCU-1) window.
SCSU does not allow the setting of a dynamic window anywhere within the
Yi range (U+A000 through U+A4C6). The only way to encode Yi text in
SCSU is to use "Unicode mode," encoding each character in 2 bytes (MSB,
LSB). This is stated in the text you quoted.
It's possible that some sequences of Yi might benefit from being
encodable in a dynamic window, but since it is not possible to do so,
the point is moot.
> 2) Any leakage of ASCII into Yi in single-byte mode would result in
> the ASCII being encoded at one byte per character, rather than two
> bytes per character.
Sufficiently long sequences of ASCII characters might justify a switch
out of Unicode mode into single-byte mode, where the compression thus
gained would be justified.
> I'd be happy to do the analysis myself if someone could point me to
> representative Unicode-encoded texts. (I'd do the SCSU test by
> transposing the scalar values from A000 onwards to 2200 onwards.) Of
> course, the quality of a SCSU compressor could make a big difference
> with a script like the Yi syllabary. For example, a simple tweak to
> my SCSU encoder improved Inuktitut (Canadian Aboriginal Syllabics)
> performance from 1.54 to 1.49 bytes per character, and my encoder
> deliberately keeps its state small - one byte look-ahead and no
This is different, because a SCSU window can be set to the Canadian
Syllabics range. Likewise for Ethiopic.
-- Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Fri Jul 15 2005 - 02:52:33 CDT