Re: Still can't work out whats a "canonical decomp" vs a "compatibility decomp"

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 07 2003 - 20:18:13 EDT

  • Next message: souravm: "Query on setting UTF-8 as encoding in the context of WebSphere application server"

    Asmus wrote:

    > Both our external environment and our practical experience with the use and
    > effect of decompositions has expanded since they were designed nearly 10
    > years ago. It's time to take the consequences. If the existing
    > decompositions are essentially frozen (and I agree that they must be), that
    > means adding additional properties, so implementers can get back a clear
    > set of mappings that are graduated by their effect and suitable context of
    > applicability.

    Amen, brother! Testi-fie!

    Seriously, the existing decompositions, which have a long history,
    and which were originally created (starting in 1993) as a kind
    of set of helpful annotations, before they morphed into the
    basis for formal normalization framework they now serve, are
    often getting in the way of people understanding the Unicode Standard,
    rather than helping them.

    Only people who have had a long, continued experience with the
    twists and turns of the last decade, or who make the effort
    to lay out the Unicode 1.0, 1.1, 2.0, and 3.0 documentation
    side-by-side and to fire up the greps and diffs on all the
    versions of UnicodeData.txt over the years can really follow
    what has gone on or why many of these mappings ended up the
    way they are now.

    I agree that it is probably time to start on the process of
    creating a new set of more nuanced (and documented) equivalence
    mappings for the Unicode Standard -- ones that are not
    encumbered by the immutability implied by the Normalization
    algorithm.

    Who knows, it could even become a fun group project, where
    one person gets to track down all the instances of characters
    that are equivalent to a base letter + accent sequence,
    another gets to track down all instances of characters
    that might evaluate to 6 (including, for instance, U+03DD
    GREEK SMALL LETTER DIGAMMA), another gets to track down
    all the glottal stops (including U+02BB, U+02BC, U+02BE,
    U+02C0, and -- trivia question, not yet in Unicode 4.0 --
    U+097D DEVANAGARI GLOTTAL STOP), and another gets to track
    down all the characters whose glyphs look like a dot, ...

    Another consideration to keep in mind is that the compatibility
    decompositions have always been implicated in an oft
    suggested, never-completed project for "Cleanicode" -- Unicode
    as she ought to have been, if legacy compatibility hadn't
    been an issue for the encoding. I think there may still be
    some value in someone trying to sift out all the legacy
    compatibility hacks in Unicode to express how the various
    scripts (and symbol sets) could have been encoded right (and
    in some cases, still can be implemented). For example, the
    Braille symbols set is a *good* example. It is a complete,
    rationalized set, and it is hard to imagine, now, doing
    it any differently. Korean represents the extreme opposite,
    with 3 different legacy encodings represented (Hangul
    syllables, compatibility jamos, half-width jamos), in
    addition to the recommended conjoining jamos. And even the
    conjoining jamos have some issues still, when applied to
    Old Korean syllables.

    That Cleanicode project is (or ought to be) distinct from
    the kind of project that Asmus has in mind of providing
    more precise, graduated, equivalence mappings that can
    be useful to implementations to actually produce the
    results that people expect, but which they may not get today
    just based on normalization forms.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed May 07 2003 - 21:13:13 EDT