Re: Normalization question

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Tue Dec 25 2007 - 06:45:57 CST

  • Next message: Tom Gewecke: "Re: Normalization question"

    Benjamin M Scarborough wrote:

    > [...] I'm unclear as to whether the NFC form would be
    > <U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW, U+0328
    > COMBINING OGONEK> (which is the shortest form) or <U+0104 LATIN
    > CAPITAL LETTER A WITH OGONEK, U+0323 COMBINING DOT BELOW, U+0302
    > COMBINING CIRCUMFLEX ACCENT>.

    The latter. In the canonical decomposition phase, the nonspacing marks
    are reordered to a fixed order according to the ccc = Canonical
    Combining Class property. (In this case, this happens to coincide with
    their original order in the data.) Then, in to canonical composition
    phase, characters are combined starting from a starter character like
    "A" and first using the _next_ combining mark. Here the ogonek gets
    combined with "A", and after this, no further compositions are possible.

    One way to check things quickly is to use the BabelPad editor, which
    lets you input character data in different ways and then select a string
    and use the Convert command first to convert to NFC and then the
    characters to U+nnnnn notation.

    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/



    This archive was generated by hypermail 2.1.5 : Tue Dec 25 2007 - 06:49:21 CST