Re: Compression through normalization

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Dec 05 2003 - 13:07:30 EST

  • Next message: Deborah W. Anderson: "Re: Supporting the Unicode Project"

    Think you are missing a negative, see below.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Doug Ewell" <dewell@adelphia.net>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Cc: "Kenneth Whistler" <kenw@sybase.com>; <mark.davis@jtcsv.com>
    Sent: Fri, 2003 Dec 05 08:43
    Subject: Re: Compression through normalization

    > Kenneth Whistler <kenw at sybase dot com> wrote:
    >
    > > Canonical equivalence is about not modifying the interpretation of the
    > > text. That is different from considerations about not changing the
    > > text, period.
    > >
    > > If some process using text is sensitive to *any* change in the text
    > > whatsover (CRC-checking or any form of digital signaturing, memory
    > > allocation), then, of course, *any* change to the text, including any
    > > normalization, will make a difference.
    > >
    > > If some process using text is sensitive to the *interpretation* of the
    > > text, i.e. it is concerned about the content and meaning of the
    > > letters involved, then normalization, to forms NFC or NFD, which only
    > > involve canonical equivalences, will *not* make a difference.
    >
    > All right. I think that is the missing piece I needed.
    >
    > How's this:
    >
    > Compression techniques may optionally replace certain sequences with
    > canonically equivalent sequences to improve efficiency, but *only* if
    > the output of the decompressed text is expected to be
    is not required to be
    > codepoint-for-codepoint equivalent to the original. Whether this is
    > true or not depends on the user and the intended use of the text.
    >
    > Text compression techniques are generally assumed to be "lossless,"
    > meaning that no information -- including meta-information -- is altered
    > by compressing and decompressing the text. However, this is not always
    > the case for other types of data. In particular, video and audio
    > formats often incorporate some form of "lossy" compression where the
    > benefit of reduced size outweighs the potential degradation of the
    > original image or sample.
    >
    > Because Unicode incorporates the notion of canonical equivalence, the
    > line between "lossless" and "lossy" is not as clear as with other
    > character encoding standards. Conformance clause C10 says (roughly)
    > that a process may choose any canonical-equivalent sequence for a run of
    > text without altering the interpretation of the text. Compression of
    > Unicode text may be assumed either to (a) retain only the
    > interpretation, in which case this is acceptable, or (b) retain the
    > exact code points, in which case it is not.
    >
    > Mark indicated that a compression-decompression cycle should not only
    > stick to canonical-equivalent sequences, which is what C10 requires, but
    > should convert text only to NFC (if at all). Ken mentioned
    > normalization "to forms NFC or NFD," but I'm not sure this was in the
    > same context. (Can we find a consensus on this?)
    >
    > No substitution of compatibility equivalents or other privately defined
    > equivalents is acceptable. A compressor can obviously convert its input
    > to whatever representation it likes, but it must be able to recover the
    > original input exactly, or "equivalently" as described above.
    >
    > > Or to be more subtle about it, it might make a difference, but it is
    > > nonconformant to claim that a process which claims it does not make a
    > > difference is nonconformant.
    > >
    > > If you can parse that last sentence, then you are well on the way to
    > > understanding the Tao of Unicode.
    >
    > I had to read it a few times, but such things are necessary along the
    > Path of Enlightenment.
    >
    > -Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 14:07:20 EST