From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Dec 05 2003 - 13:07:30 EST
Think you are missing a negative, see below.
Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄
----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Unicode Mailing List" <unicode@unicode.org>
Cc: "Kenneth Whistler" <kenw@sybase.com>; <mark.davis@jtcsv.com>
Sent: Fri, 2003 Dec 05 08:43
Subject: Re: Compression through normalization
> Kenneth Whistler <kenw at sybase dot com> wrote:
>
> > Canonical equivalence is about not modifying the interpretation of the
> > text. That is different from considerations about not changing the
> > text, period.
> >
> > If some process using text is sensitive to *any* change in the text
> > whatsover (CRC-checking or any form of digital signaturing, memory
> > allocation), then, of course, *any* change to the text, including any
> > normalization, will make a difference.
> >
> > If some process using text is sensitive to the *interpretation* of the
> > text, i.e. it is concerned about the content and meaning of the
> > letters involved, then normalization, to forms NFC or NFD, which only
> > involve canonical equivalences, will *not* make a difference.
>
> All right. I think that is the missing piece I needed.
>
> How's this:
>
> Compression techniques may optionally replace certain sequences with
> canonically equivalent sequences to improve efficiency, but *only* if
> the output of the decompressed text is expected to be
is not required to be
> codepoint-for-codepoint equivalent to the original. Whether this is
> true or not depends on the user and the intended use of the text.
>
> Text compression techniques are generally assumed to be "lossless,"
> meaning that no information -- including meta-information -- is altered
> by compressing and decompressing the text. However, this is not always
> the case for other types of data. In particular, video and audio
> formats often incorporate some form of "lossy" compression where the
> benefit of reduced size outweighs the potential degradation of the
> original image or sample.
>
> Because Unicode incorporates the notion of canonical equivalence, the
> line between "lossless" and "lossy" is not as clear as with other
> character encoding standards. Conformance clause C10 says (roughly)
> that a process may choose any canonical-equivalent sequence for a run of
> text without altering the interpretation of the text. Compression of
> Unicode text may be assumed either to (a) retain only the
> interpretation, in which case this is acceptable, or (b) retain the
> exact code points, in which case it is not.
>
> Mark indicated that a compression-decompression cycle should not only
> stick to canonical-equivalent sequences, which is what C10 requires, but
> should convert text only to NFC (if at all). Ken mentioned
> normalization "to forms NFC or NFD," but I'm not sure this was in the
> same context. (Can we find a consensus on this?)
>
> No substitution of compatibility equivalents or other privately defined
> equivalents is acceptable. A compressor can obviously convert its input
> to whatever representation it likes, but it must be able to recover the
> original input exactly, or "equivalently" as described above.
>
> > Or to be more subtle about it, it might make a difference, but it is
> > nonconformant to claim that a process which claims it does not make a
> > difference is nonconformant.
> >
> > If you can parse that last sentence, then you are well on the way to
> > understanding the Tao of Unicode.
>
> I had to read it a few times, but such things are necessary along the
> Path of Enlightenment.
>
> -Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 14:07:20 EST