Re: UTN #31 and direct compression of code points

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Wed May 09 2007 - 17:38:16 CDT

  • Next message: Mike: "Numeric type for ideographs"

    Philippe Verdy wrote on Wednesday, May 09, 2007 9:01 PM
    Subject: RE: UTN #31 and direct compression of code points

    > Doug Ewell wrote:
    >> Shavian can be encoded extremely efficiently in SCSU: only one byte per
    >> character, plus three bytes of overhead (0B 60 08) at the start of the
    >> stream to set up a dynamic window, and another (01) to quote each U+00B7
    >> "namer dot." I doubt the simplified LZ method presented in UTN #31 can
    >> top this, but of course there's nothing like experimentation.

    > No, I have doubt that SCSU will outperform an algorithms based on
    > Lempel-Ziv, even a simplified version. Remember the effect of detecting
    > previous matches: matches are encoding sequences of "arbitrary" length as
    > very few bytes (complete words, expressions, or common word prefixes and
    > suffixes and common punctuation around them), so you'll get less thanone
    > byte per character in typical texts, like for Latin.

    I chose this example carefully. SCSU will use only slightly over one byte
    per character for Shavian. The UTN #31 scheme compresses UTF-16, so the
    possible matches will effectively be keyed by:

    1) Shavian letter (high surrogate, low surrogate)
    2) Inverted Shavian letter (low surrogate, high surrogate)
    3) ASCII punctuation plus space
    4) ASCII space plus the high surrogate
    5) Shavian letter plus ASCII punctuation (low surrogate, ASCII character)

    Each match will take at least two bytes. Typical matches will match a pair
    of low surrogate and high surrogate, and be two code units long. Even for a
    highly redundant phrase such as 'high surrogate, low surrogate', I estimate
    24 bytes for the compression algorithm and 24 bytes for SCSU.

    A UTF-32 version of the algorithm would outperform SCSU on Shavian.

    Richard.



    This archive was generated by hypermail 2.1.5 : Wed May 09 2007 - 17:39:37 CDT