Re: All-kana documents

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 04 2002 - 20:02:37 EST


> If I have some all-kana documents ..., is there an
> extension of UTF-8 that will alow me to strip off the redundant "this is
> kana" byte from most of the kana?

No.

> After the first few thousand kana, it
> might be like, "Yeah, we get it already! It's kana! It's KANA!! You can
> stop reminding us now!!"

If I decide to emulate the Buddha and fill text files with a million
DEVANAGARI OM symbols in a row, each instance is still U+0950, whether
represented in UTF-16 or UTF-8 (or UTF-32, for that matter).

Stop thinking in terms of bytes and start thinking in terms of
characters.

For that matter, say you were reading the genetic code:
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;
ATG, Methionine; ATG, Methionine; ATG, Methionine; ATG, Methionine;...

Yeah, we get it already! It's methionine! It's METHIONINE!! You can
stop reminding us now!!

A code is what it is.

>
> This goes too for Hebrew, Greek, etc.

What you are looking for are text compression algorithms. See UTS #6,
A Standard Compression Scheme for Unicode.

--Ken



This archive was generated by hypermail 2.1.2 : Mon Mar 04 2002 - 19:57:18 EST