Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Dec 06 2003 - 12:49:52 EST

  • Next message: Peter Jacobi: "Transcoding Tamil in the presence of markup"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > First C10 only restricts modifications just to preserve all the
    > semantics of the encoded text in any context. There are situations
    > where this restriction does not apply: when performing text
    > transformations (such as folding, or even substringing, which may or
    > may not respect canonical equivalence: case folding applied to
    > substrings does not as concatenation of folding of substrings does not
    > always return canonically equivalent results, even if canonical
    > cluster bounds are preserved when substringing).

    Of course. C10 talks about a process that "purports not to modify the
    interpretation" of its input. I'm not sure how explicit transformations
    like folding and substringing came up.

    > Compression of an existing text is not viewed as being a text
    > transformation, so the intent of C10 should be observed, but only if
    > the compressor claims that _it preserves_ canonical equivalence.

    It had better do at least that. Any text compressor that transforms its
    input in a way not allowed by C10 would be a complete failure.

    > But as C10 does not mandate any normalized form (just canonical
    > equivalence of the results), I don't think that it requires that a
    > compressor should produce its result in either NFC or NFD form

    Right. I know that. But Mark and Ken said it should, and so I'm trying
    to find out if this *SHOULD* be in C10, or alternatively how I would
    expect to find out about this extra recommendation if I weren't on the
    Unicode mailing list.

    Note that this is a devil's advocate argument. I'm not necessarily
    disagreeing with Mark and Ken's recommendations. I'm just trying to
    reconcile the differences between what they say (NFC or NFD only) and
    what C10 says (any canonically equivalent sequence).

    > Instead I think that it's up to the next process to determine which
    > normalization form best fits its need: if the compressor was desinged
    > to recompose to NFC, and then the next process prefers NFD, the last
    > renormlization in the compressor will be superfluous.
    >
    > So for me, a compressor can choose its own noramlization on input and
    > apply it before compressing, and the decompressor needs nothing else
    > than just decompressing and keeping the string in the form that was
    > accepted or forced on input by the compressor and encoded in the
    > compressed stream.
    >
    > With this view, normalization of strings should not be done on output
    > from a process but on its input.

    Yes. Correct. To paraphrase Peter Kirk, according to C9, if my Unicode
    text is going through Process A (which outputs NFD) and on to Process B
    (which wants NFC input), A does not have to convert the text to NFC just
    to appease B. Instead, it is up to B to do the work of converting to
    NFC.

    So, in a compressor/decompressor environment, the compressor is the one
    that has to do any normalization work. The decompressor just
    decompresses. This is consistent with what I wrote a few days ago about
    wanting this normalized-compression bit to work with existing
    decompressors.

    > One can save unnecessary normalizations across multiple processes in
    > one system, provided that the strings produced on output are reliable
    > marked (out of band with some meta-data) to indicate their current
    > supported normalization forms (i.e. if the string is already in one or
    > more of the 4 standardized normalization forms): this just requires 8
    > bits of information on output of any process, in addition to the
    > output string, with 2 bits per form to mean: YES, NO, UNKNOWN/UNTESTED
    > (and posibly MAYBE, if the input is also MAYBE and the process does
    > not force or checks any normalization form on input, this test being
    > left for the next process if it needs it).
    >
    > If this is not indicated (in the output from an external and not
    > directly supported process), then a fast-check on input may be used if
    > this saves work.

    Subprocesses within a closed system may be able to make certain
    assumptions for efficiency. Process B, for example, may know that its
    only source of input is Process A, which is guaranteed always to produce
    NFC. For any other situation, the algorithm described in Annex 8 of UAX
    #15 should be employed (not re-invented).

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 13:42:12 EST