Re: Decomposed vs Composed accented characters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Apr 11 2006 - 19:39:11 CST

  • Next message: Philippe Verdy: "Re: The Phaistos Disc"

    From: "Tay, William" <William.Tay@xerox.com>
    > Please help clarify my understanding on the concept of normalization form.
    > Is it true that every character represented as binary data is said to be in a
    > normalization form (NFC, NFKC, NFD, or NFKD)?

    NO. Valid Unicode-encoded text may be in a form that does not equal any of the 4 normalized forms.
    However Unicode defines conditions under which the various forms are considered "canonically equivalent" (they have the same NFC forms or the same NFD forms), or "compatibility equivalent" (they have the same NFKD forms or the same NFKC forms).

    Unicode equivalences are independant of the encoding scheme used (UTF-8, UTF-16, UTF-32, BOCU-1, SCSU, ... or other Unicode compatible encoding schemes that strictly preserve the code point identity).

    > If so, please indicate if each of the following statement is true or false.
    > 1. The character é represented as C3 A9 in UTF-8 is in NFC form.

    YES, but only because this stream of bytes (in the UTF-8 encoding scheme or encoding form)
    represents the string of one character <U+00A9> and this text is in NFC form.

    > 2. The same character é represented as 65 CC 81 (e + acute accent) in UTF-8
    > is in NFD form.

    YES, but only because this stream of bytes (in the UTF-8 encoding scheme or encoding form)
    represents the string of two characters <U+0065 ; U+0301> and this text is in NFD form.

    > 3. Normalization is a process that transforms a character representation from
    > one normalization form to another.

    No. It can transform texts in *arbitrary* (non normalized) forms into a normalized form. The result of the normalization of a valid Unicode text is a (possibly distinct) text, which is either "equal" or "canonically equivalent" or "compatibility equivalent" to the original texts (depending on the normalization type).

    This normalization is performed independantly of the encoding form actually used in memory accesses (streams of fixed-size code units) or of the encoding scheme actually used in serialized data (streams of bytes). The normalization operates at the code points level.

    In Unicode, any stream containing arbitrary sequences of *valid* code points assigned to *characters* can be normalized. All these streams (normalized or not) are *valid*, even if some of them may be "defective".

    If the stream of *code points* contains code points assigned to non-characters, or *code points* assigned to surrogates, or invalid code points (out of range), the stream does not represent valid Unicode *text*. (note: streams of code units may contain surrogates under some conditions defined according to the definition of the encoding form; they don't represent *characters* isolately, so surrogatecode points can be treated exactly like other "non-characters" or "invalid characters").

    Philippe.



    This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 19:57:25 CST