RE: Decomposition vs Full decomposition?

From: Jon Hanna (jon@hackcraft.net)
Date: Tue Mar 15 2005 - 08:05:24 CST

  • Next message: Richard T. Gillam: "RE: Decomposition vs Full decomposition?"

    > I've been struggling to understand what the UCD.html file means, when
    > it talks about decompositions and full decompositions, and combining
    > chars.

    Some of the characters produced by a decomposition can themselves be
    decomposed. When you take a character and recursively decompose it (i.e.
    decompose it, and then decompose the characters produced until there are no
    more decomposables) that is a full decomposition. In practice that recursive
    process can be optimised (a speed/size trade-off) to a single step.

    > It's quite wordy, and uses ambiguous terms (to the outsider).

    They are unambiguous, but yes, you do need to be familiar with the
    terminology. This is a frequent complaint about computer standards, though
    having been sailing in the past I have to say computer standards are far
    from the most confusing area in this regard.

    > 1) Is "full decomposition" the same as "normalisation"?
    >
    > 2) Is normalisation differing from decompostion, by the ordering of
    > combining chars? IE, if I had a "double dots above the letter" (like
    > ü), and "Squiggly thing below the letter" (like Ç), both on the same
    > letter, then I suppose they should take on a certain
    > ordering, correct?
    > One of the combiners should always come before the other one.
    >
    > Is that what makes normalisation differ from decomposition?

    You are close here. There are four different forms of normalisation. Two
    operate by performing a full decomposition followed by a re-ordering of
    combining marks into canonical order. One of these (Normalisation Form D, or
    NFD) only decomposes canonical decompositions, the other (NFKD) will
    decompose compatibility decompositions. The result of this is that NFD
    produces text that, to a fully compliant Unicode text process, is identical
    in meaning to the original, NFKD loses some semantics, but this degree of
    "fuzziness" is useful in some applications.

    Two other forms NFC and NFKC begin by performing the operations of NFD and
    NFKD respectively, and then re-combine some characters, the resulting text
    is hence smaller in terms of codepoints (there are some other important
    advantages, for example text in many legacy character sets, including some
    significant to some other technologies, will go through NFC unchanged).

    See <http://www.unicode.org/reports/tr15/> for more.

    > 3) I remember someone mentioning some special cases for
    > normalisation,
    > that aren't included in UnicodeData.txt. But I don't remember what
    > these special cases were. Where do I read about them?

    Exclusions listed at
    <http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt> are not
    recombined in NFC and NFKC. Again, see
    <http://www.unicode.org/reports/tr15/> for more.

    The
    > writers of technical standards rarely seem to have the talents that a
    > writer of books like "C in Plain English" has, or the O'Reilly in a
    > nutshell books.

    The writers of technical standards frequently have such talents, and
    frequently are also the authors of such books. However the degree of
    hand-holding and turning a blind-eye to relatively minor technical points is
    as unnecessary in such a case as it is for you to insult them for doing so.
    As a rule it isn't desirable either.

    Even if they did lack such talents, it would be considerably better than the
    likes of Schmidt's books on C. He has that talent in spades, but it doesn't
    make him right. His annotations of the ISO C Standard do serve to
    demonstrate how style is of very little value indeed when it comes to
    standards.

    In my opinion the Unicode Standard is one of the most readable texts ever
    published with the word "Standard" in it's title, but it's not a primer.

    <http://www.hackcraft.net/xmlUnicode/> *is* a primer and if you know XML you
    may find it to be of some value, or you may not since my own skills there
    are admittedly quite modest, but even a "perfect" primer won't be what you
    need when you are looking for the precise details. The joelonsoftware site
    has one too, IMHO it's not the best for accuracy, but as a primer it may
    benefit you.

    Regards,
    Jon Hanna
    Work: <http://www.selkieweb.com/>
    Play: <http://www.hackcraft.net/>
    Chat: <irc://irc.freenode.net/selkie>



    This archive was generated by hypermail 2.1.5 : Tue Mar 15 2005 - 08:06:35 CST