SCSU/BOCU-1 Compressibility of the Yi syllabary

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jul 14 2005 - 14:04:52 CDT

Next message: Donald Z. Osborn: "Questions re ISO-639-1,2,3"

Previous message: Peter Constable: "RE: Questions re ISO-639-1,2,3"
Next in thread: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Reply: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 5 December 2004, under the title 'Nicest UTF', Doug Ewell wrote in reply
to Philippe Verdy, as is archived at
http://www.unicode.org/mail-arch/unicode-ml/y2004-m12/0104.html :

>> Only the encoder may be a bit complex to write (if one wants to
>> generate the optimal smallest result size), but even a moderate
>> programmer could find a simple and working scheme with a still
>> excellent compression rate (around 1 to 1.2 bytes per character on
>> average for any Latin text, and around 1.2 to 1.5 bytes per character
>> for Asian texts which would still be a good application of SCSU face
>> to UTF-32 or even UTF-8).

> If by "Asian texts" you mean CJK ideographs (*), precomposed Hangul, or Yi
> syllables, you have no chance of doing better than 2 bytes per character.
> This is because it is not possible in SCSU to set a dynamic window to any
> range between U+3400 and U+DFFF, where these characters
reside. Such a window would be of little use anyway, because real-world
texts using these characters would draw from so many windows that
single-byte mode would be less efficient than Unicode mode, where 2 bytes
per character is the norm. Of course, this is still better than UTF-32 or
UTF-8 for these characters.

Has there been any investigation of how badly the Yi syllabary would
compress under SCSU if dynamic windows were available for it? Actual BOCU-1
results might give a good indication. With only 0x4C7 syllables, Yi might
perform better than one might expect. Possible reasons for improvement
might be:

1) Both syllables of alliterative compounds would often be in the same SCSU
(or BOCU-1) window.

2) Any leakage of ASCII into Yi in single-byte mode would result in the
ASCII being encoded at one byte per character, rather than two bytes per
character.

3) Initial consonants have different frequencies, and so some windows would
be needed less frequently than others. This would reduce the number of
dynamic window redefinitions required.

I'd be happy to do the analysis myself if someone could point me to
representative Unicode-encoded texts. (I'd do the SCSU test by transposing
the scalar values from A000 onwards to 2200 onwards.) Of course, the
quality of a SCSU compressor could make a big difference with a script like
the Yi syllabary. For example, a simple tweak to my SCSU encoder improved
Inuktitut (Canadian Aboriginal Syllabics) performance from 1.54 to 1.49
bytes per character, and my encoder deliberately keeps its state small - one
byte look-ahead and no statistics.

Richard.

Next message: Donald Z. Osborn: "Questions re ISO-639-1,2,3"
Previous message: Peter Constable: "RE: Questions re ISO-639-1,2,3"
Next in thread: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Reply: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jul 14 2005 - 14:07:05 CDT