Re: Decomposition vs Full decomposition?

From: Philippe VERDY (
Date: Wed Mar 16 2005 - 13:19:40 CST

  • Next message: Philippe VERDY: "Re: Re: Serbian-Latin "sh" alias and ISO-639-1 within CLDR"

    "Theodore H. Smith" <> wrote:
    > 3) I remember someone mentioning some special cases for normalisation,
    > that aren't included in UnicodeData.txt. But I don't remember what
    > these special cases were. Where do I read about them?

    I think you this refers to "composition exclusions". This is a special rule which makes a difference only with NFC and NFKC normalized composed forms, and that affects the last step of these two transformations (the full canonical recomposition), because some characters are decomposable or canonically mapped to another singleton character or to a pair base+diacritic, but must not be "recomposed" the other way, after the canonical reordering step.

    * You can find excluded singletons simply because they are characters whose decomposition mapping is canonical and a single character. These characters are compatibility characters, and their canonical mapping gives their prefered form (for example the Angström symbol is canonically mapped to a singleton character, the A WITH A RING ABOVE, which itself is canonically decomposable to a pair letter+diacritic, but when you recompose it, the pair becomes A WITH RING ABOVE, and you MUST NOT "recompose" this singleton to the Angström symbol).

    * Those precombined characters that have a canonical decomposition to a pair that must not be recomposed, are listed in a separate file. These exclusions could have been listed in the main UCD file directly by prefixing the canonical decomposition by a composition type tag like "<excluded>", but for compaitibility with applications that prosee the UCD assuming that the presence of a tag means that it is not a canonical decomposition mapping but a compatitility one, would then not apply the decomposition, and so this would break their NFD algorithm implementation (it would not break NFKD though). As the format of the UCD main file is apparently frozen, there was no other choice than adding another file in the UCD for listing these characters.

    This archive was generated by hypermail 2.1.5 : Wed Mar 16 2005 - 13:21:06 CST