Re: Compression through normalization

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Dec 04 2003 - 12:18:33 EST

  • Next message: Peter Constable: "RE: MS Windows and Unicode 4.0 ?"

    I was much too brief in my note. I should have said:

    The operations of compression followed by decompression can conformantly produce
    any text that is canonically equivalent to the original without purporting to
    modify the text. (How the internal compressed format is determined is completely
    arbitrary - it could NFD, compress, decompress, NFC; swap alternate bits; remap
    modern jamo and LV's to a contiguous range, BOCU-1 it; whatever). In practice,
    if a compressor does not produce codepoint-identical text, it should produce NFC
    (not just any canonically equivalent text), and should document that it does so.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Doug Ewell" <dewell@adelphia.net>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Sent: Thu, 2003 Dec 04 08:39
    Subject: Re: Compression through normalization

    > Just to clear up some possible misconceptions that I think may have
    > developed:
    >
    > This thread started when Philippe Verdy mentioned the possibility of
    > converting certain sequences of Unicode characters to a *canonically
    > equivalent sequence* to improve compression. An example was converting
    > Korean text, encoded with individual jamos, to a precomposed syllable or
    > a combination of LV syllables plus T jamos.
    >
    > This type of conversion seems to be permissible under conformance clause
    > C10, which states (paraphrasing here) that a process may replace a given
    > character sequence by a canonical-equivalent sequence, and still claim
    > not to have changed the interpretation of that sequence.
    >
    > My question was whether a Unicode text compressor could legitimately
    > convert text to a different canonical-equivalent sequence for purposes
    > of improving compression, without violating users' expectations of
    > so-called "lossless" compression. Some list members pointed out that
    > the checksum of the compressed-and-decompressed text would not match the
    > original, and wondered about possible security concerns.
    >
    > I've been waiting to see if any UTC members, or other experts in
    > conformance or compression issues, had anything to say about this. So
    > far, the only such response has been from Mark Davis, who said that "a
    > compressor can normalize, if (a) when decompressing it produces NFC, and
    > (b) it advertises that it normalizes."
    >
    > To clarify what I am NOT looking for:
    >
    > (1) I am interested in the applicability of C10 to EXISTING compression
    > techniques, such as for SCSU or BOCU-1, or for general-purpose
    > algorithms like Huffman and LZ. Any approach that requires existing
    > *decompressors* to be modified in order to undo the new transformation
    > is NOT of interest. That amounts to inventing a new compression scheme.
    >
    > (2) I am NOT interested in inventing a new normalization form, or any
    > variants on existing forms. Any approach that involves compatibility
    > equivalences, ignores the Composition Exclusions table, or creates
    > equivalences that do not exist in the Unicode Character Database (such
    > as "U+1109 + U+1109 = U+110A") is NOT of interest. That amounts to
    > unilaterally extending C10, which may already be too liberal to be
    > applied to compression.
    >
    > Note that (1) and (2) are closely related.
    >
    > -Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 13:17:48 EST