RE: Decomposed vs Composed accented characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 11 2006 - 18:46:58 CST

  • Next message: N. Ganesan: ""markers" codepoints for some combining letter sets in Dravidian scripts"

    > Please help clarify my understanding on the concept of normalization form.
    >
    > Is it true that every character represented as binary data
                                                     ^^^^^^^^^^^
                                                     plain text
                                                     
    Unicode is a plain text standard, not a standard of "binary data".
    A UTF-8 string is just a sequence of characters.
                                                    
    > is said to be in a normalization form (NFC, NFKC, NFD, or NFKD)?

    False.

    A Unicode string can also not be in *any* normalization form.
      
    >
    > If so, please indicate if each of the following statement is true or false.
    > 1. The character é represented as C3 A9 in UTF-8 is in NFC form.

    True.

    > 2. The same character é represented as 65 CC 81 (e + acute accent)
    > in UTF-8 is in NFD form.

    True.

    > 3. Normalization is a process that transforms a character
    > representation from one normalization form to another.

    Not quite. Normalization converts a Unicode string (which *might*
    be in a normalized form already) into a particular normalized form.

    Example of a Unicode string which is not in any normalization
    form:

    <0041, 0301, 0328> (i.e. <A, acute, ogonek>)

    The NFD (and NFKD) form of that string is:

    <0041, 0328, 0301> (i.e. <A, ogonek, acute>)

    The NFC (and NFKC) form of that string is:

    <0104, 0301> (i.e. <A-ogonek, acute>)

    I've expressed those strings in UTF-16, to make it easier to
    look up the characters, if you want -- but converting them
    all to UTF-8 doesn't change anything about their normalized
    statuses.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:49:42 CST