RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 06 2003 - 06:48:23 EST

  • Next message: Peter Kirk: "Re: Compression through normalization"

    Doug Ewell
    > OK, then I suppose I should play devil's advocate and ask Peter's and
    > Philippe's question again: If C10 only restricts the modifications to
    > "canonically equivalent sequences," why should there be an additional
    > restriction that further limits them to NFC or NFD? Or, put another
    > way, shouldn't such a restriction be part of C10, if it is important?

    First C10 only restricts modifications just to preserve all the semantics of
    the encoded text in any context. There are situations where this restriction
    does not apply: when performing text transformations (such as folding, or
    even substringing, which may or may not respect canonical equivalence: case
    folding applied to substrings does not as concatenation of folding of
    substrings does not always return canonically equivalent results, even if
    canonical cluster bounds are preserved when substringing).

    Compression of an existing text is not viewed as being a text
    transformation, so the intent of C10 should be observed, but only if the
    compressor claims that _it preserves_ canonical equivalence.

    But as C10 does not mandate any normalized form (just canonical equivalence
    of the results), I don't think that it requires that a compressor should
    produce its result in either NFC or NFD form

    Instead I think that it's up to the next process to determine which
    normalization form best fits its need: if the compressor was desinged to
    recompose to NFC, and then the next process prefers NFD, the last
    renormlization in the compressor will be superfluous.

    So for me, a compressor can choose its own noramlization on input and apply
    it before compressing, and the decompressor needs nothing else than just
    decompressing and keeping the string in the form that was accepted or forced
    on input by the compressor and encoded in the compressed stream.

    With this view, normalization of strings should not be done on output from a
    process but on its input.

    One can save unnecessary normalizations across multiple processes in one
    system, provided that the strings produced on output are reliable marked
    (out of band with some meta-data) to indicate their current supported
    normalization forms (i.e. if the string is already in one or more of the 4
    standardized normalization forms): this just requires 8 bits of information
    on output of any process, in addition to the output string, with 2 bits per
    form to mean: YES, NO, UNKNOWN/UNTESTED (and posibly MAYBE, if the input is
    also MAYBE and the process does not force or checks any normalization form
    on input, this test being left for the next process if it needs it).

    If this is not indicated (in the output from an external and not directly
    supported process), then a fast-check on input may be used if this saves
    work.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 07:45:10 EST