Re: Compression through normalization

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Dec 05 2003 - 16:13:19 EST

  • Next message: Philippe Verdy: "RE: Compression through normalization"

    Doug asked:

    > Mark indicated that a compression-decompression cycle should not only
    > stick to canonical-equivalent sequences, which is what C10 requires, but
    > should convert text only to NFC (if at all). Ken mentioned
    > normalization "to forms NFC or NFD," but I'm not sure this was in the
    > same context. (Can we find a consensus on this?)

    I don't think either of our recommendations here are specific
    to compression issues.

    Basically, if a process tinkers around with changing sequences
    to their canonical equivalents, then it is advisable that
    the end result actually *be* in one of the normalization
    forms, either NFD or NFC, and that this be explicitly documented
    as what the process does. Otherwise, you are just tinkering
    and leaving the data in an indeterminate (although still
    canonically equivalent) state.

    Mark recommended NFC in particular, since that is the "least
    marked" (*hehe*) normalization form, i.e., the one that you
    are most likely to encounter, and the one that most Internet
    or web processes are likely to prefer.

    --Ken

    P.S. On the other hand, if you asked him nicely, Mark might
    find the more marked form, NFD, to his liking, especially
    since it is likely to contain more combining marks. Mark
    is definitely in favor of markedness. I, on the other hand,
    am definitely in favor of kennings, but we have found little
    practical or architectural use for them in the Unicode
    character-sea.



    This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 17:24:25 EST