Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Dec 04 2003 - 11:39:46 EST

  • Next message: Raymond Mercier: "Re: MS Windows and Unicode 4.0 ?"

    Just to clear up some possible misconceptions that I think may have
    developed:

    This thread started when Philippe Verdy mentioned the possibility of
    converting certain sequences of Unicode characters to a *canonically
    equivalent sequence* to improve compression. An example was converting
    Korean text, encoded with individual jamos, to a precomposed syllable or
    a combination of LV syllables plus T jamos.

    This type of conversion seems to be permissible under conformance clause
    C10, which states (paraphrasing here) that a process may replace a given
    character sequence by a canonical-equivalent sequence, and still claim
    not to have changed the interpretation of that sequence.

    My question was whether a Unicode text compressor could legitimately
    convert text to a different canonical-equivalent sequence for purposes
    of improving compression, without violating users' expectations of
    so-called "lossless" compression. Some list members pointed out that
    the checksum of the compressed-and-decompressed text would not match the
    original, and wondered about possible security concerns.

    I've been waiting to see if any UTC members, or other experts in
    conformance or compression issues, had anything to say about this. So
    far, the only such response has been from Mark Davis, who said that "a
    compressor can normalize, if (a) when decompressing it produces NFC, and
    (b) it advertises that it normalizes."

    To clarify what I am NOT looking for:

    (1) I am interested in the applicability of C10 to EXISTING compression
    techniques, such as for SCSU or BOCU-1, or for general-purpose
    algorithms like Huffman and LZ. Any approach that requires existing
    *decompressors* to be modified in order to undo the new transformation
    is NOT of interest. That amounts to inventing a new compression scheme.

    (2) I am NOT interested in inventing a new normalization form, or any
    variants on existing forms. Any approach that involves compatibility
    equivalences, ignores the Composition Exclusions table, or creates
    equivalences that do not exist in the Unicode Character Database (such
    as "U+1109 + U+1109 = U+110A") is NOT of interest. That amounts to
    unilaterally extending C10, which may already be too liberal to be
    applied to compression.

    Note that (1) and (2) are closely related.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 12:37:32 EST