Re: Implementing NFC

From: Daniel Ehrenberg (microdan@gmail.com)
Date: Fri Mar 16 2007 - 11:30:47 CST

  • Next message: Markus Scherer: "Re: Implementing NFC"

    Thanks for your explanation. That doesn't sound too difficult to
    implement (though not too much fun either). But judging from the looks
    of it, I don't think the Normalizer demo actually implements this. Am
    I reading what you said correctly, or would this involve repeatedly
    iterating through groups of accent marks until no more combinations
    can be done?

    I'm just wondering, are there any other programming languages that
    handle Unicode by storing strings in a consistently normalized form? I
    made the decision to do that so that all programs could follow the
    Unicode conformance clause that no conformant process should presume
    that two canonically equivalent strings are distinct. Are there any
    other ways of assuring this?

    Daniel Ehrenberg

    On 3/16/07, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    > Note that for NFD (and NFKD) it's not enough to decompose the characters
    > recursively; you also need to reorder them according to their combining
    > class (except for those whose combining class is 0 that are limiting the
    > span of characters that can be reordered). You also have to decompose the
    > Hangul syllables algorithmically (they are absent from the UCD tables)
    >
    > For NFC (and NFKC), it's also not enough to only recompose the pairs,
    > because some decomposable characters followed by a combining characters will
    > have to be recomposed in another way, by permuting the combining characters
    > if they have different non-zero combining classes, and then recompose the
    > first one.
    >
    > Note also that when composing pairs, the NFC or NFKC forms will sometime
    > have to compose non-adjacent characters, where there may be an intermediate
    > combining character with a non-zero combining class between the base
    > character and the combining character selected for combination: this occurs
    > when there's no composition possible with the first combining character, but
    > the composition is possible with the next one, provided that it has a
    > *different* non-zero combining class.
    >
    > Suppose that the combining classes of characters are like this in a string
    > in NFD form:
    > [0, 220, 230, 0]
    > If composition is not possible with the first two characters, it may be
    > possible with the (0, 230) pair and the result will be the string whose
    > character combining classes will be [0, 220, 0] where the first zero results
    > of the composition of the first and third character. This still preserves
    > the canonical equivalence because this composed string will still be
    > reordered when decomposed...
    >
    > There are also a few combining characters that are precomposed pairs but
    > must be decomposed first because one of them will recompose with a prior
    > base character. This occurs for example with the Greek character block)
    >
    > When computing NFC and NFKC, you still need to compute the reordering of
    > combining characters even if there's no way to assemble them in pairs,
    > because not all of them are composable in pairs (or triples for algorithmic
    > Hangul syllables but these do not require reordering).
    >
    > Finally when composing characters, be warned that some precomposed
    > characters are excluded from recomposition (so they must be decomposed in
    > NFD or NFKD, but not be recomposed in NFC or NFKC; these characters are then
    > necessarily absent from strings in all four normalized forms, and they are
    > present in Unicode only for round-trip compatibility with past encodings, or
    > sometimes excluded of recompositions for stability of the normalized forms
    > across Unicode versions). You may eventually compose them without breaking
    > the canonical equivalence, but the resulting string will not be in the
    > stable normalized form (this means that a string in NFC form is not
    > necessarily the shorted one within the set of canonically equivalent
    > strings: NFC is not a compression algorithm).



    This archive was generated by hypermail 2.1.5 : Fri Mar 16 2007 - 11:33:09 CST