Re: SCSU/BOCU-1 Compressibility of the Yi syllabary

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jul 15 2005 - 14:44:01 CDT

Next message: Peter Constable: "RE: [indic] Gurmukhi Bindi/Tippi Positioning"

Previous message: Michael Everson: "Happy Rosetta Stone Day"
In reply to: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Next in thread: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Reply: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

>> Has there been any investigation of how badly the Yi syllabary would
>> compress under SCSU if dynamic windows were available for it? Actual
>> BOCU-1 results might give a good indication. With only 0x4C7

Self-correction: 0x48D *syllables*.

>> syllables, Yi might perform better than one might expect.

> SCSU does not allow the setting of a dynamic window anywhere within the
> Yi range (U+A000 through U+A4C6). The only way to encode Yi text in
> SCSU is to use "Unicode mode," encoding each character in 2 bytes (MSB,
> LSB).

> It's possible that some sequences of Yi might benefit from being
> encodable in a dynamic window, but since it is not possible to do so,
> the point is moot.

Not entirely; see further below.

I actually got an answer to my question through a straightforward
mathematical analysis. The model I used was that characters can be covered
by N SCSU windows (all in the BMP), and characters from the windows are
equally likely to occur and the sequences are totally random. (I think this
gives a pessimistic estimate, though not worst case. The results are better
than I expected.) I then applied a single-character look-ahead SCSU
compressor design which, where possible, only uses single byte mode and
achieves the following asymptotic byte per character ratios:

N = 2 => 1.33
N = 3 => 1.50
N = 4 => 1.60
...
N = 7 => 1.75
N = 8 => 1.78
N = 9 => 1.90 (Window definitions are needed throughout the text for N>8)
N = 10 => 2.00
N = 11 => 2.08

For the Yi syllabary, N = 10, so I believe single-byte mode windows would
work better than Unicode mode. The various windows are not equally common,
and so the 'effective value' of N will be less than 10.

>> 2) Any leakage of ASCII into Yi in single-byte mode would result in
>> the ASCII being encoded at one byte per character, rather than two
>> bytes per character.

> Sufficiently long sequences of ASCII characters might justify a switch
> out of Unicode mode into single-byte mode, where the compression thus
> gained would be justified.

I was thinking of very short leakages, such as a single SPACE (U+0020) or
<CR><LF>. Such leakages offer no savings if 'Unicode' (should really be
'UTF-16[BE]') mode is appropriate on either side.

What made me interested in the issue is that Syloti Nagri, a Brahmi script,
is encoded at A800-A82F, slap in the middle of the ideographic range. Now
there are three interesting quotes from the SCSU technical report (
http://www.unicode.org/reports/tr6/tr6-4.html Version 3.6(?)):

1. "The first part of the Window Offset Table defines half blocks covering
the alphabetic scripts, symbols and the private use area."

This is only a major issue for the BMP; all half-blocks in the supplementary
planes are covered by the SDX and UDX window-defining tags. With this minor
cavil, the statement was probably once true - however, the gap from Yi to
Hangul syllables (U+A4C7 to AC00) is now mostly road-mapped (
http://www.unicode.org/roadmaps/bmp/ ) for small scripts.

2. Of the codes used with the SDn and UDn tags, "A8..F8 [are] reserved for
future use".

Perhaps this future is approaching. As single-byte mode does compress the
Yi syllabary, it seems reasonable to suggest using:

Codes A8..B7 for Half-blocks with starts from U+A400 to U+AB80
Code B8 for Half-block starting D780. (The road map shows unused space
here.)

The biggest objection will, of course, be that existing SCSU decoders will
not recognise these codes. The SCSU is not something to be updated
frequently - which means any changes may have to be *big*.

If one rejects the idea of having SCSU windows for Yi syllables, one might
want to have the first of these new windows start at U+04C0.

3. 1. "The Standard Compression Scheme for Unicode will:
a.. ...
b.. approximate the storage size of traditional character sets"
How does this apply to Egyptian Hieroglyphs (
http://www.unicode.org/roadmaps/smp/ )? I believe the analogy should be
with 2-byte character encodings such as Shift-JIS. However, for the
supplementary planes, if SCSU cannot compress well in single-byte mode,
which it cannot for large scripts, the usage will approach 4 bytes per
character. (This a slight over-estimate - I'd make a stab at 3.7 for
Egyptian Hieroglyphs using single-byte mode.) One method would be to have
windows that are half a plane wide. I'd suggest the same semantics as the
present windows, but with values encoded by two bytes, not one byte. The
'extended window tags' (SDX/UDX) would not be used to define half-plane
windows. The leading byte would have its high bit set.

This leads to:

Codes B9..DA for half-planes starting at 0000, 8000, 10000,... 108000.
Code F8 for half-plane starting 16000 (basically Tangut ideographs, but
allowing a lot of slop in placement.)
Code F7 for half-plane starting 2E00 (basically CJK, but not perfect).

The idea of the special code for CJK is that it then allows 3-byte encodings
(SQn x y) for occasional characters in the supplementary plane and free
mixing of ASCII and CJK. (Unicode mode does not have quote tags, perhaps
because they did not make sense for UCS-2.) The codes F7 and F8 may be
regarded as luxuries. Having a half-plane window at 8000 makes it possible
to avoid the UQU tag needed in Unicode mode for many PUA characters, which
is bad news for any SCSU-user making heavy use of them.

My *suggestion* leaves DB to F6 free for future expansion. If Unicode
becomes universal rather than merely global in our grandchildren's time,
they may need these expansion slots!

Note that there are two separate ideas here - mundane extension of windows
to the non-ideographic Yi-Hangul gap, and half-plane windows.

Richard.

Next message: Peter Constable: "RE: [indic] Gurmukhi Bindi/Tippi Positioning"
Previous message: Michael Everson: "Happy Rosetta Stone Day"
In reply to: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Next in thread: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Reply: Doug Ewell: "Re: SCSU/BOCU-1 Compressibility of the Yi syllabary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 15 2005 - 14:45:36 CDT