Re: Data compression

From: Doug Ewell (dewell@adelphia.net)
Date: Sat May 07 2005 - 12:36:30 CDT

  • Next message: Peter Kirk: "Re: Data compression"

    N. Ganesan <naa dot ganesan at gmail dot com> wrote:

    > Can you tell a little more on SCSU. Any pointers, URLs to how it works
    > on texts, say Tamil unicode text? Tamil letters are not conjuncts,
    > something similar in this sense to Latin script of Europe.

    As Philippe said, SCSU is just a specification for encoding sequences of
    Unicode code points. It does not make any difference whether a given
    code point represents a conjunct, or even whether there is a character
    assigned to that code point at all. SCSU does not change the encoding
    model used for Tamil; it simply specifies a different, usually more
    efficient, way of converting Unicode code points into bytes for storage
    and transmission.

    The quick answer is that a text consisting of only characters in the
    Tamil and Basic Latin blocks can be encoded in only one byte per
    character. Tamil characters in the range U+0B80 to U+0BFF are encoded
    using the bytes from 80 to FF, while Basic Latin characters are encoded
    using the bytes from 00 through 7F. There is an initial two-byte
    sequence to specify that the high range is to be used for Tamil, and
    not, say, Cyrillic or Devanagari.

    The Tamil word

    வணக்கம்

    is encoded using the bytes

    1B 17 B5 A3 95 CD 95 AE CD

    The initial 1B 17 switches into "Tamil mode," so to speak, and the
    remaining bytes are simply the Unicode code points. Basic Latin
    (including CR and LF, but not all control characters) would be encoded
    using their ASCII values; for example, an ordinary space would be 20.

    If other blocks are used besides Tamil and Basic Latin, there is
    additional overhead to switch between the blocks.

    Philippe Verdy <verdy underscore p at wanadoo dot fr> replied:

    > It could be a valid UTF because it preserves all codepoints in an
    > original string, without even altering its normalization form (so no
    > code point are reordered, even if the original string is not in any
    > normalized form), and also because it still allows encoding invalid
    > code points.

    All text compression schemes must be lossless.

    > But, unlike UTF-8, UTF-16, UTF-32 standard encoding schemes...
    > SCSU does NOT guarantee a unique encoding for the same represented
    > codepoints: there are several alternatives, which allow SCSU
    > compressors to be implemented with simple algorithms, or with more
    > complex algorithms with better compression level;

    This is described in UTN #14. There is a mirror to the relevant section
    at http://users.adelphia.net/~dewell/compression.html#scsu.

    It is possible to build an "OK" compressor or a "really good" compressor
    within the same spec. This is also true for some types of non-text
    compression.

    > however the SCSU decompressor is fully predictive and can be parsed
    > into only one valid sequence of codepoints from a valid SCSU
    > compressed stream.

    It would have to be, if compression is to be lossless.

    > This means that you can't check the "equality" of two encoded SCSU
    > streams, without first decompressing them to streams of code points.
    > (You can safely check encoded strings for equality with UTF-8, UTF-16,
    > UTF-32, UTF-EBCDIC, and CESU-8).

    I guess it depends on whether this is a common and desirable thing to
    do. In some cases it might make more sense to decompress the stream
    first before comparing anything. The tradeoff is that a relatively
    simple SCSU encoder can be implemented. The world's greatest SCSU
    encoder would probably have to implement full-text lookahead and
    language-specific predictive algorithms, but thankfully they don't have
    to.

    --
    Doug Ewell
    Fullerton, California
    http://users.adelphia.net/~dewell/
    


    This archive was generated by hypermail 2.1.5 : Sat May 07 2005 - 12:39:13 CDT