From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 11 2006 - 18:46:58 CST
> Please help clarify my understanding on the concept of normalization form.
>
> Is it true that every character represented as binary data
^^^^^^^^^^^
plain text
Unicode is a plain text standard, not a standard of "binary data".
A UTF-8 string is just a sequence of characters.
> is said to be in a normalization form (NFC, NFKC, NFD, or NFKD)?
False.
A Unicode string can also not be in *any* normalization form.
>
> If so, please indicate if each of the following statement is true or false.
> 1. The character é represented as C3 A9 in UTF-8 is in NFC form.
True.
> 2. The same character é represented as 65 CC 81 (e + acute accent)
> in UTF-8 is in NFD form.
True.
> 3. Normalization is a process that transforms a character
> representation from one normalization form to another.
Not quite. Normalization converts a Unicode string (which *might*
be in a normalized form already) into a particular normalized form.
Example of a Unicode string which is not in any normalization
form:
<0041, 0301, 0328> (i.e. <A, acute, ogonek>)
The NFD (and NFKD) form of that string is:
<0041, 0328, 0301> (i.e. <A, ogonek, acute>)
The NFC (and NFKC) form of that string is:
<0104, 0301> (i.e. <A-ogonek, acute>)
I've expressed those strings in UTF-16, to make it easier to
look up the characters, if you want -- but converting them
all to UTF-8 doesn't change anything about their normalized
statuses.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:49:42 CST