Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 05 2003 - 05:51:39 EST

  • Next message: Peter Kirk: "Re: Supporting the Unicode Project"

    On 05/12/2003 00:34, Doug Ewell wrote:

    >Peter Kirk <peterkirk at qaya dot org> wrote:
    >
    >
    >
    >>Surely ignoring Composition Exclusions is not unilaterally extending
    >>C10. The excluded precomposed characters are still canonically
    >>equivalent to the decomposed (and normalised) forms. And so composing
    >>a text with them, for compression or any other purpose, still conforms
    >>to C10, which explicitly allows "replacement of character sequences by
    >>their canonical-equivalent sequences" - not only when the resulting
    >>sequence is NFC or NFD.
    >>
    >>
    >
    >Ignoring the composition exclusions does still respect canonical
    >equivalence, but does not preserve a canonical normalization form (using
    >the language of UAX #15). So although it is not a violation of C10, it
    >does seem to run afoul of Mark's recommendation:
    >
    >"In practice, if a compressor does not produce codepoint-identical text,
    >it should produce NFC
    >(not just any canonically equivalent text), and should document that it
    >does so."
    >
    >
    >
    >
    OK. So it's Mark, not me, who is unilaterally extending C10. Well, Ken
    said much the same, so it's bilateral; and I agree it is a sensible
    extension.

    But, as Ken also pointed out, it is quite permissible to use any
    encoding for the intermediate e.g. compressed form of the text, as long
    as it is possible to recover from this the normalised form of the
    original text. My suggestion of composing the text using composition
    exclusions meets this test, in a way not met by some of the other
    suggestions, e.g. composing Korean characters into precomposed forms
    which are (sadly) not canonically equivalent.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 06:28:17 EST