RE: Decomposed vs Composed accented characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Apr 11 2006 - 18:46:58 CST

Next message: N. Ganesan: ""markers" codepoints for some combining letter sets in Dravidian scripts"

Previous message: Kenneth Whistler: "Re: Decomposed vs Composed accented characters"
Maybe in reply to: Tay, William: "Decomposed vs Composed accented characters"
Next in thread: Keutgen, Walter: "RE: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Please help clarify my understanding on the concept of normalization form.
>
> Is it true that every character represented as binary data
                                                 ^^^^^^^^^^^
                                                 plain text

Unicode is a plain text standard, not a standard of "binary data".
A UTF-8 string is just a sequence of characters.

> is said to be in a normalization form (NFC, NFKC, NFD, or NFKD)?

False.

A Unicode string can also not be in *any* normalization form.

>
> If so, please indicate if each of the following statement is true or false.
> 1. The character é represented as C3 A9 in UTF-8 is in NFC form.

True.

> 2. The same character é represented as 65 CC 81 (e + acute accent)
> in UTF-8 is in NFD form.

True.

> 3. Normalization is a process that transforms a character
> representation from one normalization form to another.

Not quite. Normalization converts a Unicode string (which *might*
be in a normalized form already) into a particular normalized form.

Example of a Unicode string which is not in any normalization
form:

<0041, 0301, 0328> (i.e. <A, acute, ogonek>)

The NFD (and NFKD) form of that string is:

<0041, 0328, 0301> (i.e. <A, ogonek, acute>)

The NFC (and NFKC) form of that string is:

<0104, 0301> (i.e. <A-ogonek, acute>)

I've expressed those strings in UTF-16, to make it easier to
look up the characters, if you want -- but converting them
all to UTF-8 doesn't change anything about their normalized
statuses.

--Ken

Next message: N. Ganesan: ""markers" codepoints for some combining letter sets in Dravidian scripts"
Previous message: Kenneth Whistler: "Re: Decomposed vs Composed accented characters"
Maybe in reply to: Tay, William: "Decomposed vs Composed accented characters"
Next in thread: Keutgen, Walter: "RE: Decomposed vs Composed accented characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 11 2006 - 18:49:42 CST