Re: Unicode, SMS and year 2012

From: Doug Ewell <doug_at_ewellic.org>
Date: Sat, 28 Apr 2012 22:28:50 -0600

Richard Wordingham wrote:

> With SCSU that avoids Unicode mode and UQU whenever possible, most
> alphabetic languages work fairly well. However, extra windows are
> needed to cover the half-blocks from A480 to ABFF, 15 new codes. If I
> were being miserly, I wouldn't cover A500-A5FF.

In November 2010 I proposed updating the SCSU spec to do just that.
(There were a couple of other suggestions in the proposal, all
severable.) Reaction to the proposal was not encouraging:

http://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0005.html
http://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0008.html

> SCSU doesn't work well with large syllabaries, especially if they
> include a lot of unused characters within the half-blocks used. Inuit
> suffers badly from this, but still achieves noticeable compression.
> I experimented with compressing Yi transposed to a covered range, and
> found that it achieved something like 10% compression. Yi suffers
> from needing the 8 dynamic windows to be switched between 10 half-
> blocks (with occasionally excursions to an 11th.) If the Yi
> characters had been arranged by tone first and initial consonant
> second, 2 of the half-blocks would never have been used in my sample!

Medium-sized writing systems such as syllabaries, that span more than
one or two 128-blocks and cross among them constantly (not just for
isolated characters), have always been the Achilles heel of SCSU. You
can't realistically encode something like Canadian Syllabics on its own
using 7 bits per character, or even 8. The best hope is to be able to
use windows, and hope that window switching can be kept to a minimum. As
you noted with Yi, how successful that is depends on character frequency
and whether "common" characters are concentrated in one or two
half-blocks, or whether they are scattered.

The design goal of SCSU was to encode text about as efficiently as in
legacy encodings. For small alphabetic scripts, the examples were the
numerous 8-bit encodings for Latin and Cyrillic and Greek, as well as
things like ARMSCII and ISCII. Unicode mode was meant for really large
scripts like Han and precomposed Hangul, where 16 bits per character was
considered acceptable (and better than UTF-8). The design goal was met,
but medium-sized scripts (with no legacy encodings to compete against)
didn't fare so well. There is no mechanism in SCSU to encode a character
in a non-integral number of bytes, and that's probably good; such a
mechanism would have made SCSU, already criticized for its complexity,
much more complex.

Note that most of the above applies to BOCU-1 as well, for what it's
worth.

> Vai A500-A63F fits in 3 half-blocks, and I would expect non-Vai
> characters in it to be in static blocks. Given how well Yi performed,
> I expect Vai to benefit from SCSU.

It does benefit by comparison to UTF-8. Addition of window offset bytes
to point to this area would help further, but see "not encouraging"
above.

> Has anyone investigated the performance of SCSU with Cuneiform or
> Egyptian Hieroglyphics? It might achieve better than 50% compression!
> A fair comparison of Egyptian Hieroglyphics depends on the mark-up
> used, for Unicode on its own does not enable one to write reasonable
> Middle Egyptian.

If you have realistic samples of text in these scripts that you could
send (privately), I could experiment. Most of my samples for
experimentation in compression have lately come from the UDHR in Unicode
project.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 
Received on Sat Apr 28 2012 - 23:32:33 CDT

This archive was generated by hypermail 2.2.0 : Sat Apr 28 2012 - 23:32:35 CDT