From: Jon Hanna (firstname.lastname@example.org)
Date: Tue Mar 15 2005 - 08:05:24 CST
> I've been struggling to understand what the UCD.html file means, when
> it talks about decompositions and full decompositions, and combining
Some of the characters produced by a decomposition can themselves be
decomposed. When you take a character and recursively decompose it (i.e.
decompose it, and then decompose the characters produced until there are no
more decomposables) that is a full decomposition. In practice that recursive
process can be optimised (a speed/size trade-off) to a single step.
> It's quite wordy, and uses ambiguous terms (to the outsider).
They are unambiguous, but yes, you do need to be familiar with the
terminology. This is a frequent complaint about computer standards, though
having been sailing in the past I have to say computer standards are far
from the most confusing area in this regard.
> 1) Is "full decomposition" the same as "normalisation"?
> 2) Is normalisation differing from decompostion, by the ordering of
> combining chars? IE, if I had a "double dots above the letter" (like
> ü), and "Squiggly thing below the letter" (like Ç), both on the same
> letter, then I suppose they should take on a certain
> ordering, correct?
> One of the combiners should always come before the other one.
> Is that what makes normalisation differ from decomposition?
You are close here. There are four different forms of normalisation. Two
operate by performing a full decomposition followed by a re-ordering of
combining marks into canonical order. One of these (Normalisation Form D, or
NFD) only decomposes canonical decompositions, the other (NFKD) will
decompose compatibility decompositions. The result of this is that NFD
produces text that, to a fully compliant Unicode text process, is identical
in meaning to the original, NFKD loses some semantics, but this degree of
"fuzziness" is useful in some applications.
Two other forms NFC and NFKC begin by performing the operations of NFD and
NFKD respectively, and then re-combine some characters, the resulting text
is hence smaller in terms of codepoints (there are some other important
advantages, for example text in many legacy character sets, including some
significant to some other technologies, will go through NFC unchanged).
See <http://www.unicode.org/reports/tr15/> for more.
> 3) I remember someone mentioning some special cases for
> that aren't included in UnicodeData.txt. But I don't remember what
> these special cases were. Where do I read about them?
Exclusions listed at
<http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt> are not
recombined in NFC and NFKC. Again, see
<http://www.unicode.org/reports/tr15/> for more.
> writers of technical standards rarely seem to have the talents that a
> writer of books like "C in Plain English" has, or the O'Reilly in a
> nutshell books.
The writers of technical standards frequently have such talents, and
frequently are also the authors of such books. However the degree of
hand-holding and turning a blind-eye to relatively minor technical points is
as unnecessary in such a case as it is for you to insult them for doing so.
As a rule it isn't desirable either.
Even if they did lack such talents, it would be considerably better than the
likes of Schmidt's books on C. He has that talent in spades, but it doesn't
make him right. His annotations of the ISO C Standard do serve to
demonstrate how style is of very little value indeed when it comes to
In my opinion the Unicode Standard is one of the most readable texts ever
published with the word "Standard" in it's title, but it's not a primer.
<http://www.hackcraft.net/xmlUnicode/> *is* a primer and if you know XML you
may find it to be of some value, or you may not since my own skills there
are admittedly quite modest, but even a "perfect" primer won't be what you
need when you are looking for the precise details. The joelonsoftware site
has one too, IMHO it's not the best for accuracy, but as a primer it may
This archive was generated by hypermail 2.1.5 : Tue Mar 15 2005 - 08:06:35 CST