Re: Compressing Vietnamese with SCSU

From: Doug Ewell (dewell@compuserve.com)
Date: Wed Apr 19 2000 - 10:21:01 EDT


Kenneth Whistler <kenw@sybase.com> wrote:

> I think it would be appropriate also to check Windows Code Page 1258,
> which uses a hybrid strategy of precomposed characters for the base
> vowels, plus combining marks for the tones. If you convert Windows
> 1258 data directly to Unicode, you avoid all the 1EXX code points, and
> might get better behavior with SCSU. (Although you also start out with
> more voluminous data, since the tones are separately represented.)

A lot of Vietnamese text on the Web appears to be encoded in VISCII,
but I will hunt around for some text in CP1258. I would definitely
expect better compression if the U+1E80 to U+1EFF block is excluded.

Part of the reason I chose to test Vietnamese was exactly that its far-
flung Unicode code points for commonly used characters might provide an
interesting edge case for SCSU compression.

Frankly, I didn't know CP1258 used combining marks in this way, since I
wasn't aware that a Windows display engine would be able to put them
together properly. Things you learn!

I actually expected someone to ask why I didn't convert the Vietnamese
to canonical decompositions (which would certainly result in better
SCSU performance), and the answer is that (a) decomposition isn't a part
of the SCSU spec and (b) the source data would indeed be quite a bit
more voluminous.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT