RE: Question about normalization tests from Whistler, Ken on 2012-12-10 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Mon, 10 Dec 2012 20:57:36 +0000

Your misunderstanding is at the highlighted statement below. Actually 0300 *is* blocked from 0061 in this sequence, because it is preceded by a character with the same canonical combining class (i.e. U+0305, ccc=230). A blocking context is the preceding combining character either having ccc=0 or having ccc greater than or equal to the character being checked.

--Ken

Starting with the NFD decomposition string, we retrieve the combining classes for each character from the UnicodeData.txt file:

0061 - 0
05AE - 228
0305 - 230
0300 - 230
0315 - 232
0062 - 0

You start at the first character after the starter (0061, with ccc=0), which is 05AE. There is no primary composition for the sequence 0061 05AE, so you move on.

Looking at 0305, it is not blocked from 0061, so check the primary composition for 0061 0305. There is none for that either, so move on.

Looking at 0300, it is also not blocked from 0061, so check the primary composition for 0061 0300. There is a primary composition for that sequence, 00E0, so replace the starter with that, delete the 0300, and continue. The string looks like this now:

00E0 - 0
05AE - 228
0305 - 230
0315 - 232
0062 - 0

Checking 0315 and 0062, they are not blocked, but there is no composition with 00E0, so the algorithm ends with the result:
00E0 05AE 0305 0315 0062

This disagrees with what it says in the normalization tests file as listed above. The question is, did I misunderstand the algorithm, or is this perhaps a bug in the data file?

Thanks,

Edwin

Received on Mon Dec 10 2012 - 14:59:48 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 10 2012 - 14:59:50 CST