Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Dec 06 2003 - 13:58:26 EST

  • Next message: Doug Ewell: "Re: Compression through normalization"

    On 06/12/2003 09:49, Doug Ewell wrote:

    > ...
    >
    >>But as C10 does not mandate any normalized form (just canonical
    >>equivalence of the results), I don't think that it requires that a
    >>compressor should produce its result in either NFC or NFD form
    >>
    >>
    >
    >Right. I know that. But Mark and Ken said it should, and so I'm trying
    >to find out if this *SHOULD* be in C10, or alternatively how I would
    >expect to find out about this extra recommendation if I weren't on the
    >Unicode mailing list.
    >
    >
    My feeling is that it should NOT be in C10. The conformance rules do not
    mention normalisation, which is a convenience and a recommendation but
    not a conformance issue - that is, apart from C14-C16 which do not
    specify that normalisation should be done, only how it must be done if
    it is done. I would also argue that adding such a rule to C10 would
    conflict with C9 or at least with the principle underlying it, that for
    conformance purposes all canonically equivalent forms are equal and
    indistinguishable.

    >Note that this is a devil's advocate argument. I'm not necessarily
    >disagreeing with Mark and Ken's recommendations. I'm just trying to
    >reconcile the differences between what they say (NFC or NFD only) and
    >what C10 says (any canonically equivalent sequence).
    >
    >
    >
    I would agree with these recommendations as long as they remain
    recommendations rather than conformance requirements, and are perhaps
    added to section 5.6 of TUS (4.0), which already includes the sentence
    (in the context of canonically equivalent alternative spellings):

    > Implementations that are “liberal” in what they accept, but
    > “conservative” in what they issue, will have the fewest compatibility
    > problems.

    >>...
    >>With this view, normalization of strings should not be done on output
    >>from a process but on its input.
    >>
    >>
    >
    >Yes. Correct. To paraphrase Peter Kirk, according to C9, if my Unicode
    >text is going through Process A (which outputs NFD) and on to Process B
    >(which wants NFC input), A does not have to convert the text to NFC just
    >to appease B. Instead, it is up to B to do the work of converting to
    >NFC.
    >
    >
    A more accurate paraphrase would be that process A may output any
    canonically equivalent form, maybe one which is not a normalisation form
    at all e.g. because it has composed composition exclusions.

    >So, in a compressor/decompressor environment, the compressor is the one
    >that has to do any normalization work. The decompressor just
    >decompresses. This is consistent with what I wrote a few days ago about
    >wanting this normalized-compression bit to work with existing
    >decompressors.
    >
    >
    >
    For "any normalization work" I would prefer to say "any canonically
    equivalent transformation work". The compressor performs whatever
    canonically equivalent transformations it chooses, perhaps because they
    produce the best compression. The decompressor is only obliged to
    decompress; C9 implies that it is not obliged to normalise and that no
    other process can rely on it doing so - although it is free to do so and
    may choose to do so if that is expected to improve overall efficiency.
    Mark and Ken's recommendation is indeed that it should do so, and I have
    no quarrel with that.

    > ...
    >
    >Subprocesses within a closed system may be able to make certain
    >assumptions for efficiency. Process B, for example, may know that its
    >only source of input is Process A, which is guaranteed always to produce
    >NFC. ...
    >
    Does C9 actually allow this? Well, perhaps within a closed system, but
    then standardisation and so Unicode is irrelevant to data transfer
    between sub-processes within a closed system. Outside a closed system,
    Process B's best assumption for efficiency may be that Process A has
    *probably* normalised, and so it is worth performing first a quick check
    before a full normalisation.

    >... For any other situation, the algorithm described in Annex 8 of UAX
    >#15 should be employed (not re-invented).
    >
    >
    >

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 14:55:54 EST