Re: Understanding normalisation

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon May 29 2006 - 14:16:55 CDT

  • Next message: Doug Ewell: "Re: Unicode, SMS, PDA/cellphones"

    Theodore H. Smith wrote on Sunday, May 28, 2006 at 3:18 PM

    > I'm wondering, what limitations would it have for being useful for doing
    > decomposition? And for doing composition?

    The limitation on what you currently have is that it may not work with
    letters with two or more combining marks.

    > Is it true, that if I perform a proper combining character reordering (As
    > described by UTR15) upon some Unicode text, and then did my "parallel
    > string replacement based composer" upon the text, that I'd generate
    > correct NFC?

    No. Consider <U+006D LATIN SMALL LETTER M, U+0325 COMBINING RING BELOW,
    U+0301 COMBINING ACUTE ACCENT>, which is in NFD. (It is the last letter in
    one spelling of the reconstructed Proto-Indo-European word for 'seven' -
    septḿ̥) The canonically equivalent NFC form is <U+1E3F LATIN SMALL LETTER
    M WITH ACUTE, U+0325 COMBINING RING BELOW>. This happens because there is
    no 'LATIN SMALL LETTER M WITH COMBINING RING BELOW'. On the other hand the
    more traditional spelling, septṃ́, ends in what is expressed in NFD as
    <U+006D LATIN SMALL LETTER M, U+0323 COMBINING DOT BELOW, U+0301 COMBINING
    ACUTE ACCENT>. The canonically equivalent NFC form is <U+1E43 LATIN SMALL
    LETTER M WITH DOT BELOW, U+0301 COMBINING ACUTE ACCENT>.

    The complication in forming NFC is choosing which following character of
    non-zero combining class to combine with. The aim is to consider all of
    those following characters which could come next to what has been combined
    so far in a canonically equivalent sequence. The one choice for fusion is
    in a sense arbitrary, but so as to have a *canonical* form the one that
    comes first in NFD order is chosen.

    Thus, for the first sequence above, <U+006D, U+0325, U+0301> and <U+006D,
    U+0301, U+0325> are equivalent. However, only U+0301 and U+006D combine, so
    one combines to yield <U+1E3F, U+0325>. Further combination is not
    possible.

    For the second sequence, <U+006D, U+0323, U+0301> and <U+006D, U+0301,
    U+0323> are equivalent. U+006D could combine with either of the following
    combining characters. U+0323 is of combining class 220 and U+0301 is of
    combining class 230, so for definiteness U+006D is combined with U+0323,
    yielding <U+1E43, U+0301>. Further combination is not possible.

    Richard.



    This archive was generated by hypermail 2.1.5 : Mon May 29 2006 - 14:26:21 CDT