Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Dec 06 2003 - 11:22:03 EST

  • Next message: Doug Ewell: "Re: Compression through normalization"

    On 06/12/2003 03:48, Philippe Verdy wrote:

    > ...
    >
    >But as C10 does not mandate any normalized form (just canonical equivalence
    >of the results), I don't think that it requires that a compressor should
    >produce its result in either NFC or NFD form
    >
    >Instead I think that it's up to the next process to determine which
    >normalization form best fits its need: if the compressor was desinged to
    >recompose to NFC, and then the next process prefers NFD, the last
    >renormlization in the compressor will be superfluous.
    >
    >
    Bear in mind that according to C9

    > no process can assume that another process will make a distinction
    > between two different, but canonical-equivalent character sequences

    (including a distinction between normalisation forms) and therefore the
    next process is not supposed to rely on the (de)compressor to normalise
    into any particular form.

    >So for me, a compressor can choose its own noramlization on input and apply
    >it before compressing, and the decompressor needs nothing else than just
    >decompressing and keeping the string in the form that was accepted or forced
    >on input by the compressor and encoded in the compressed stream.
    >
    >With this view, normalization of strings should not be done on output from a
    >process but on its input.
    >
    >
    This is what C9 seems to require. Normalisation on output is not
    forbidden, of course, but the next process is not supposed to rely on it
    having been done, certainly not to fail if it has not been.

    >One can save unnecessary normalizations across multiple processes in one
    >system, provided that the strings produced on output are reliable marked
    >(out of band with some meta-data) to indicate their current supported
    >normalization forms (i.e. if the string is already in one or more of the 4
    >standardized normalization forms): this just requires 8 bits of information
    >on output of any process, in addition to the output string, with 2 bits per
    >form to mean: YES, NO, UNKNOWN/UNTESTED (and posibly MAYBE, if the input is
    >also MAYBE and the process does not force or checks any normalization form
    >on input, this test being left for the next process if it needs it).
    >
    >
    I'm not sure how well this one agrees with C9. These 8 bits have to be
    communicated between the processes in question by some protocol separate
    from the Unicode text. I am not sure if a process is permitted to rely
    on such information.

    >If this is not indicated (in the output from an external and not directly
    >supported process), then a fast-check on input may be used if this saves
    >work.
    >
    >
    >
    A fast check on input is of course sensible if the input is expected to
    be in a particular form e.g. if it is recommended by the higher level
    protocol in use; but if the fast check fails the process should not fail
    but should perform full normalisation.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 12:28:49 EST