RE: Normalisation and Greek characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 17 2003 - 18:29:54 EST

  • Next message: Kenneth Whistler: "Re: U+00D0, U+01b7 -- variants or distinct chars?"

    David Perry wrote:

    > Thanks to John for pointing me in the right direction; the normalization
    > charts were not helpful, but after spending some time with UAX#15 and
    > looking at the actual Unicode database, I see what is going on here.
    >
    > It seems strange to me that the Unicode book (where I initially looked)
    > simply gives the decomposition for U+1F71 as U+03B1 followed by U+0301
    > with no indication that anything else is involved (and likewise for
    > other characters with singleton decompositions, such as the angrstrom).
    > If U+1F71 decomposes to U+03AC, why is the other decomposition still
    > given?

    I deduce from this that you must be looking at the Unicode *2.0*
    book, which did, indeed, give the canonical decomposition of
    U+1F71 as U+03B1 + U+0301.

    But the Unicode *3.0* book gives the canonical *mapping* of
    U+1F71 as U+03AC.

    The distinction was one of theory of presentation of the data.
    Unicode 2.0 gave *full* decompositions for all characters.
    That was true for UnicodeData-2.0.14.txt, which was used to
    drive the formatting of the names list for the Unicode 2.0 book.

    But starting with the Unicode 2.1.9 update, the decompositions
    were given as decomposition mappings. Those constituted the
    *proximate* decompositions -- decompositions into the next
    closest stage. And the *full* decompositions had to be derived
    by recursive application of all applicable decomposition
    mappings. The reason for doing so is precisely because of
    the kind of problem you've run into -- listing only the full
    decompositions loses information about the intermediate steps
    in the decomposition. Once Unicode 3.0 was published, and
    once Unicode normalization was defined in UAX #15, this
    distinction became vitally important to normalization, because
    of the recomposition rules. And if you look at the Unicode 3.0
    names list (which you can find online, if necessary), then
    only the singleton decomposition mapping for U+1F71 is shown,
    as noted above.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Mar 17 2003 - 19:09:18 EST