Kenneth Whistler <firstname.lastname@example.org> wrote:
> I think it would be appropriate also to check Windows Code Page 1258,
> which uses a hybrid strategy of precomposed characters for the base
> vowels, plus combining marks for the tones. If you convert Windows
> 1258 data directly to Unicode, you avoid all the 1EXX code points, and
> might get better behavior with SCSU. (Although you also start out with
> more voluminous data, since the tones are separately represented.)
A lot of Vietnamese text on the Web appears to be encoded in VISCII,
but I will hunt around for some text in CP1258. I would definitely
expect better compression if the U+1E80 to U+1EFF block is excluded.
Part of the reason I chose to test Vietnamese was exactly that its far-
flung Unicode code points for commonly used characters might provide an
interesting edge case for SCSU compression.
Frankly, I didn't know CP1258 used combining marks in this way, since I
wasn't aware that a Windows display engine would be able to put them
together properly. Things you learn!
I actually expected someone to ask why I didn't convert the Vietnamese
to canonical decompositions (which would certainly result in better
SCSU performance), and the answer is that (a) decomposition isn't a part
of the SCSU spec and (b) the source data would indeed be quite a bit
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT