Re: Compression through normalization

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Dec 04 2003 - 17:49:05 EST

  • Next message: Kenneth Whistler: "Re: Sort Order"

    Mark said:

    > The operations of compression followed by decompression can conformantly produce
    > any text that is canonically equivalent to the original without purporting to
    > modify the text. (How the internal compressed format is determined is completely
    > arbitrary - it could NFD, compress, decompress, NFC; swap alternate bits; remap
    > modern jamo and LV's to a contiguous range, BOCU-1 it; whatever). In practice,
    > if a compressor does not produce codepoint-identical text, it should produce NFC
    > (not just any canonically equivalent text), and should document that it does so.

    Perhaps to help clear everyone's thinking here, it might help
    to further paraphrase what Mark said:

    The operations of XXX followed by YYY can conformantly produce
    any text that is canonically equivalent to the original while purporting
    not to modify the interpretation of the text.

    [For "operations of XXX followed by YYY" feel free to substitute anything
    you like. This is not about compression per se, but is the fundamental
    meaning of canonical equivalence. If the resultant output text is
    *canonically equivalent* to the original text, then the process has
    not modified the *interpretation* of the text. Note that I expanded Mark's
    formulation slightly -- his was still slightly too telegraphic.

    It may, on the other hand have *changed* the text, of course. Canonical
    equivalents may be shorter or longer, and consist of different code
    point sequences.]

    How the data format following operation XXX and preceding YYY is determined
    is completely arbitrary - it could be blargle or bleep or flassiwary;
    swap alternate bits; remap fleebert to whazzit; compress it; whatever.

    In *practice* [note this is a recommendation, and not a conformance
    requirement], if a text operation produces canonically equivalent
    text which is not codepoint-identical, it *should* produce a
    normalized form of the text and should document that it does so.

    Does that help any? This really is not about compression at all --
    it is about understanding what the conformance requirements of
    the standard are.

    Canonical equivalence is about not modifying the interpretation
    of the text. That is different from considerations about not
    changing the text, period.

    If some process using text is sensitive to *any* change in the
    text whatsover (CRC-checking or any form of digital signaturing,
    memory allocation), then, of course, *any* change to the text,
    including any normalization, will make a difference.

    If some process using text is sensitive to the *interpretation* of
    the text, i.e. it is concerned about the content and meaning of
    the letters involved, then normalization, to forms NFC or NFD,
    which only involve canonical equivalences, will *not* make a difference.
    Or to be more subtle about it, it might make a difference, but it
    is nonconformant to claim that a process which claims it does not
    make a difference is nonconformant.

    If you can parse that last sentence, then you are well on the
    way to understanding the Tao of Unicode.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 19:00:34 EST