From: Richard T. Gillam (firstname.lastname@example.org)
Date: Tue Mar 15 2005 - 09:33:18 CST
>1) Is "full decomposition" the same as "normalisation"?
No, and it's not the same as "decomposition" either. (I've heard the term "decomposition mapping" used for "decomposition", and I like it better.) The UnicodeData.txt file gives a decomposition mapping for each character that can decompose. For canonical decompositions, this is always a one- or two-character mapping. For both canonical and compatibility decompositions, the mapping may be to one or more characters that can themselves decompose. To get a character's "full decomposition," you keep replacing characters with their decomposition mappings until you get to a sequence of characters that don't decompose. (For compatibility decompositions, both canonical and compatibility mappings are used; for canonical decompositions, only canonical decomposition mappings are used.)
Normalization involves not just decomposition, but also canonical reordering and (for some normalizations), recomposition.
>2) Is normalisation differing from decompostion, by the ordering of
>combining chars? IE, if I had a "double dots above the letter" (like
>ü), and "Squiggly thing below the letter" (like Ç), both on the same
>letter, then I suppose they should take on a certain ordering, correct?
>One of the combiners should always come before the other one.
>Is that what makes normalisation differ from decomposition?
Yes, this is one of the things. Every combining mark is assigned a "combining class," which is a more-or-less-arbitrary numeric value. After applying all your decomposition mappings, sequences of combining marks (with a combining class greater than 0) are sorted by their combining class (marks with the same combining class keep their relative order, since it's significant).
The other thing that some normalizations do is recombine certain base-combining mark sequences. The basic idea is that combining character sequences that form the full canonical decomposition of some character can be replaced by that character (it's actually more complicated than this-- certain characters are excluded from recomposition, either for backward-compatibility or linguistic reasons, and sometimes the sequences that get replaced don't have to be contiguous).
So it breaks down like this:
NFD: Canonical decomposition followed by canonical reordering
NFKD: Compatibility decomposition followed by canonical reordering
NFC: Canonical decomposition, followed by canonical reordering, followed by canonical recomposition
NFKC: Compatibility decomposition, followed by canonical reordering, followed by canonical recomposition
>3) I remember someone mentioning some special cases for normalisation,
>that aren't included in UnicodeData.txt. But I don't remember what
>these special cases were. Where do I read about them?
UAX #15 has all the gory details.
>I do find the Unicode information quite heavy going and complex... but
>that is the way with technical standards, XML wasn't much bettr. The
>writers of technical standards rarely seem to have the talents that a
>writer of books like "C in Plain English" has, or the O'Reilly in a
There's a book out called "Unicode Demystified" that purports to translate a lot of this stuff into "plain English." I've heard it's pretty good. :-)
Language Analysis Systems, Inc.
This archive was generated by hypermail 2.1.5 : Tue Mar 15 2005 - 09:34:32 CST