RE: Implementing NFC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Mar 15 2007 - 22:42:29 CST

  • Next message: Hans Aberg: "Re: Arabic numbers"

    Note that for NFD (and NFKD) it's not enough to decompose the characters
    recursively; you also need to reorder them according to their combining
    class (except for those whose combining class is 0 that are limiting the
    span of characters that can be reordered). You also have to decompose the
    Hangul syllables algorithmically (they are absent from the UCD tables)

    For NFC (and NFKC), it's also not enough to only recompose the pairs,
    because some decomposable characters followed by a combining characters will
    have to be recomposed in another way, by permuting the combining characters
    if they have different non-zero combining classes, and then recompose the
    first one.

    Note also that when composing pairs, the NFC or NFKC forms will sometime
    have to compose non-adjacent characters, where there may be an intermediate
    combining character with a non-zero combining class between the base
    character and the combining character selected for combination: this occurs
    when there's no composition possible with the first combining character, but
    the composition is possible with the next one, provided that it has a
    *different* non-zero combining class.

    Suppose that the combining classes of characters are like this in a string
    in NFD form:
            [0, 220, 230, 0]
    If composition is not possible with the first two characters, it may be
    possible with the (0, 230) pair and the result will be the string whose
    character combining classes will be [0, 220, 0] where the first zero results
    of the composition of the first and third character. This still preserves
    the canonical equivalence because this composed string will still be
    reordered when decomposed...

    There are also a few combining characters that are precomposed pairs but
    must be decomposed first because one of them will recompose with a prior
    base character. This occurs for example with the Greek character block)

    When computing NFC and NFKC, you still need to compute the reordering of
    combining characters even if there's no way to assemble them in pairs,
    because not all of them are composable in pairs (or triples for algorithmic
    Hangul syllables but these do not require reordering).

    Finally when composing characters, be warned that some precomposed
    characters are excluded from recomposition (so they must be decomposed in
    NFD or NFKD, but not be recomposed in NFC or NFKC; these characters are then
    necessarily absent from strings in all four normalized forms, and they are
    present in Unicode only for round-trip compatibility with past encodings, or
    sometimes excluded of recompositions for stability of the normalized forms
    across Unicode versions). You may eventually compose them without breaking
    the canonical equivalence, but the resulting string will not be in the
    stable normalized form (this means that a string in NFC form is not
    necessarily the shorted one within the set of canonically equivalent
    strings: NFC is not a compression algorithm).

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de Daniel Ehrenberg
    > Envoyé : vendredi 16 mars 2007 03:49
    > À : unicode@unicode.org
    > Objet : Implementing NFC
    >
    > Hi,
    > I'm working on adding Unicode support (possibly eventually conformace)
    > to an obscure programming language called Factor, which is sort of a
    > cross between Forth and Lisp (see factorcode.org for more
    > information). One thing that I'm doing is that all strings will always
    > be kept in Normalization Form D (as defined in UAX #15: Normalization
    > Forms) for processing. That way all canonically equivalent strings
    > return true when tested for equality. It wasn't difficult to implement
    > NFD (or NFKD); I just needed to read the transformations from
    > UnicodeData.txt and apply them recursively to get a hash table of
    > characters to canonical/compatability-decomposed strings. But for most
    > I/O purposes, I need to use NFC, re-composing all decomposed
    > characters. I have no idea how to do this efficiently. In many cases,
    > it's more complicated than just turning two adjacent characters into
    > one character.
    >
    > I looked at both the Glib source (which defines basic unicode
    > operations) and the Normalizer demo that UAX 15 links to (which, btw,
    > only works properly for the BMP, which is bad). They both appear to
    > use generally the same strategy: perform as many pairwise compositions
    > on adjacent characters as possible. I wonder if I'm reading it wrong,
    > because if that's how it operates, then one of the examples in the UAX
    > wouldn't work properly: NFC(U+017F U+0323 U+0307) = U+1E9B U+0323.
    > This composes two non-adjacent characters. Is there any efficient way
    > to do this composition without messing up canonical ordering while
    > making sure to compose non-adjacent characters like this? It's an edge
    > case, I know, but I want my implementation to be correct.
    >
    > In many places, the Unicode standard provides clues for
    > implementation, but I see none for NFC (or NFKC) and how to compose
    > characters. Can anyone help me?



    This archive was generated by hypermail 2.1.5 : Thu Mar 15 2007 - 22:46:00 CST