Re: Still can't work out whats a "canonical decomp" vs a "compatibility decomp"

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed May 07 2003 - 17:30:41 EDT

  • Next message: Brian Doyle: "Discussion/Notices re: fonts"

    At 03:17 PM 5/7/03 -0400, John Cowan wrote:
    >Q: What's the difference between canonical and compatibility decomposition?
    >
    >A: Replacing a character by its canonical decomposition, which is either
    >one or two characters long, does not destroy information, and makes no
    >practical difference for most purposes.
    >
    >Replacing a character by its compatibility decomposition, which may be
    >of any length, does destroy information, but typically transforms the
    >character into better-known characters that may be easier to process.

    Actually, that describes the ideal - in the historic process of creating
    and maintaining these decompositions, that ideal has been compromised.

    The canonical decompositions were applied to CJK compatibility characters,
    essentially negating their purpose, and causing big practical problems in
    all environments where they are used. It's arguable that they should have
    been made compatibility decompositions.

    The compatibility decompositions of positional Arabic forms in principle
    don't destroy any information - applying their compatibility decompositions
    makes little practical difference. As far as compatibility decompositions
    go, they
    are as close to canonical as they come.

    Finally there are surprisingly many contexts in which applying
    compatibility decompositions doesn't merely destroy some information about
    the character, but can radically alter or destroy the meaning of the text.

    We would be better off with a different classification: (*)
    - informationally equivalent
    - semantically equivalent (or semantically neutral)
    - simplifying (or fuzzy equivalent)

    The first would be limited to a core of current canonical decompositions
    The second would contain the CJK compatibiliy (canonical) decompositions, the
    Arabic positional form (compatibility), etc.
    The third would contain the remainder, but would be augmented by other
    types of fuzzy equivalence not currently in compatibility mappings.

    Mappings (foldings) like HalfWidth/FullWidth folding either go into the
    semantically neutral category or they may need a category of their own.
    They are fairly semantically neutral, but unlike the other two I gave as
    examples, they are fairly visible.

    (See for example http://www.unicode.org/reports/tr30
    which contains an earlier draft of a discussion of character folding, and
    which I plan to update soon).

    This all fits by the way into the ongoing discussion of making
    Normalization tailorable, primarily in order to remove the deficiencies of
    having included some merely semantically equivalent mappings with the pure
    informational equivalences (primarily this affects the CJK compatibility
    characters, but
    nobody, having learned from our previous experience, feels inclined to
    settle this once and for all, therefore the more general concept of
    'tailoring', which would allow for better adjustments in the future.)

    A./

    (*) as we can't take away the existing decompositions, and their
    definitions, any such proposal would have to be considered as adding



    This archive was generated by hypermail 2.1.5 : Wed May 07 2003 - 18:32:00 EDT