Re: SCSU/BOCU-1 Compressibility of the Yi syllabary

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jul 15 2005 - 14:44:01 CDT

  • Next message: Peter Constable: "RE: [indic] Gurmukhi Bindi/Tippi Positioning"

    Doug Ewell wrote:

    > Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

    >> Has there been any investigation of how badly the Yi syllabary would
    >> compress under SCSU if dynamic windows were available for it? Actual
    >> BOCU-1 results might give a good indication. With only 0x4C7

    Self-correction: 0x48D *syllables*.

    >> syllables, Yi might perform better than one might expect.

    > SCSU does not allow the setting of a dynamic window anywhere within the
    > Yi range (U+A000 through U+A4C6). The only way to encode Yi text in
    > SCSU is to use "Unicode mode," encoding each character in 2 bytes (MSB,
    > LSB).

    > It's possible that some sequences of Yi might benefit from being
    > encodable in a dynamic window, but since it is not possible to do so,
    > the point is moot.

    Not entirely; see further below.

    I actually got an answer to my question through a straightforward
    mathematical analysis. The model I used was that characters can be covered
    by N SCSU windows (all in the BMP), and characters from the windows are
    equally likely to occur and the sequences are totally random. (I think this
    gives a pessimistic estimate, though not worst case. The results are better
    than I expected.) I then applied a single-character look-ahead SCSU
    compressor design which, where possible, only uses single byte mode and
    achieves the following asymptotic byte per character ratios:

    N = 2 => 1.33
    N = 3 => 1.50
    N = 4 => 1.60
    ...
    N = 7 => 1.75
    N = 8 => 1.78
    N = 9 => 1.90 (Window definitions are needed throughout the text for N>8)
    N = 10 => 2.00
    N = 11 => 2.08

    For the Yi syllabary, N = 10, so I believe single-byte mode windows would
    work better than Unicode mode. The various windows are not equally common,
    and so the 'effective value' of N will be less than 10.

    >> 2) Any leakage of ASCII into Yi in single-byte mode would result in
    >> the ASCII being encoded at one byte per character, rather than two
    >> bytes per character.

    > Sufficiently long sequences of ASCII characters might justify a switch
    > out of Unicode mode into single-byte mode, where the compression thus
    > gained would be justified.

    I was thinking of very short leakages, such as a single SPACE (U+0020) or
    <CR><LF>. Such leakages offer no savings if 'Unicode' (should really be
    'UTF-16[BE]') mode is appropriate on either side.

    What made me interested in the issue is that Syloti Nagri, a Brahmi script,
    is encoded at A800-A82F, slap in the middle of the ideographic range. Now
    there are three interesting quotes from the SCSU technical report (
    http://www.unicode.org/reports/tr6/tr6-4.html Version 3.6(?)):

    1. "The first part of the Window Offset Table defines half blocks covering
    the alphabetic scripts, symbols and the private use area."

    This is only a major issue for the BMP; all half-blocks in the supplementary
    planes are covered by the SDX and UDX window-defining tags. With this minor
    cavil, the statement was probably once true - however, the gap from Yi to
    Hangul syllables (U+A4C7 to AC00) is now mostly road-mapped (
    http://www.unicode.org/roadmaps/bmp/ ) for small scripts.

    2. Of the codes used with the SDn and UDn tags, "A8..F8 [are] reserved for
    future use".

    Perhaps this future is approaching. As single-byte mode does compress the
    Yi syllabary, it seems reasonable to suggest using:

    Codes A8..B7 for Half-blocks with starts from U+A400 to U+AB80
    Code B8 for Half-block starting D780. (The road map shows unused space
    here.)

    The biggest objection will, of course, be that existing SCSU decoders will
    not recognise these codes. The SCSU is not something to be updated
    frequently - which means any changes may have to be *big*.

    If one rejects the idea of having SCSU windows for Yi syllables, one might
    want to have the first of these new windows start at U+04C0.

    3. 1. "The Standard Compression Scheme for Unicode will:
      a.. ...
      b.. approximate the storage size of traditional character sets"
    How does this apply to Egyptian Hieroglyphs (
    http://www.unicode.org/roadmaps/smp/ )? I believe the analogy should be
    with 2-byte character encodings such as Shift-JIS. However, for the
    supplementary planes, if SCSU cannot compress well in single-byte mode,
    which it cannot for large scripts, the usage will approach 4 bytes per
    character. (This a slight over-estimate - I'd make a stab at 3.7 for
    Egyptian Hieroglyphs using single-byte mode.) One method would be to have
    windows that are half a plane wide. I'd suggest the same semantics as the
    present windows, but with values encoded by two bytes, not one byte. The
    'extended window tags' (SDX/UDX) would not be used to define half-plane
    windows. The leading byte would have its high bit set.

    This leads to:

    Codes B9..DA for half-planes starting at 0000, 8000, 10000,... 108000.
    Code F8 for half-plane starting 16000 (basically Tangut ideographs, but
    allowing a lot of slop in placement.)
    Code F7 for half-plane starting 2E00 (basically CJK, but not perfect).

    The idea of the special code for CJK is that it then allows 3-byte encodings
    (SQn x y) for occasional characters in the supplementary plane and free
    mixing of ASCII and CJK. (Unicode mode does not have quote tags, perhaps
    because they did not make sense for UCS-2.) The codes F7 and F8 may be
    regarded as luxuries. Having a half-plane window at 8000 makes it possible
    to avoid the UQU tag needed in Unicode mode for many PUA characters, which
    is bad news for any SCSU-user making heavy use of them.

    My *suggestion* leaves DB to F6 free for future expansion. If Unicode
    becomes universal rather than merely global in our grandchildren's time,
    they may need these expansion slots!

    Note that there are two separate ideas here - mundane extension of windows
    to the non-ideographic Yi-Hangul gap, and half-plane windows.

    Richard.



    This archive was generated by hypermail 2.1.5 : Fri Jul 15 2005 - 14:45:36 CDT