From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Sep 22 2009 - 16:44:32 CDT
Jeff Senn asked:
> Can someone sort out an ambiguity for me in composition during
> normalization?
> Either I misunderstand, or a couple of widely deployed implementations
> have bugs,
> and/or the standards docs imply an inconsistency.
>
> Here are 2 test cases and the question is, can the characters be
> *canonically*
> combined during normalization?
>
> case 1: 1B11, 1B35
> case 2: 0CCA, 0CD5
>
> There are (non-compatibility) decompositions for both of these
> sequences:
>
> 1B12 --> 1B11, 1B35
> 0CCB --> 0CCA, 0CD5
>
> All of these characters have combining class 0. Can they be canonically
> combined? Even though the 2nd characters are NOT "combining"?
There's the first mistake. Both of the 2nd characters in
these sequences *ARE* combining:
1B35;BALINESE VOWEL SIGN TEDUNG;Mc;0;L;;;;;N;;;;;
0CD5;KANNADA LENGTH MARK;Mc;0;L;;;;;N;;;;;
gc=Mc indicates a combining mark -- they are just not *non-spacing*
combining marks.
> My read of the current UAX15 implies "yes".
Correct. The entries from NormalizationTest.txt spell out
what the various normalized forms are:
NFC NFD NFKC NFKD
1B12;1B12;1B11 1B35;1B12;1B11 1B35; # ... BALINESE LETTER OKARA TEDUNG
NFC NFD NFKC NFKD
0CCB;0CCB;0CC6 0CC2 0CD5;0CCB;0CC6 0CC2 0CD5; # ... KANNADA VOWEL SIGN OO
> UAX15 currently says: "D2. In any character sequence beginning with a
> starter S, a character C is blocked from S if and only if there is
> some character B between S and C, and either B is a starter or it has
> the same or higher combining class as C."
>
> Since there is no character between S and C, I assume C is not
> "blocked".
Correct.
>
> However a previous of draft of UAX15 uses the phrase "A combining
> character C
> can be canonically combined with a base character B..." which implies
> "No".
>
> http://unicode.org/reports/tr15/
> http://unicode.org/reports/tr15/pdtr15.html
Proposed Draft UTR #15 (from 1998!) has no status. It is not
an approved document, nor was it ever. It should not be referred
to for anything other than historical interest in the development
of the standard.
Please refer only to the *approved* version of UAX #15 for such
a discussion.
> At least 2 implementations do the combination in case 2: Python and
> the ICU library
> (e.g. http://minaret.info/test/normalize.msp ) However ICU seems
> inconsistent in that
> it does NOT combine case 1!
Which *version* of ICU? Balinese was added in Unicode 5.0.
If you are using an older version of ICU that had not yet
been updated to Unicode 5.0, then it would not have knowledge
of Balinese, and the sequence <1B11, 1B35> (for that version)
would be unchanged by normalization.
>
> So, if the answer is indeed "YES", one might add
>
> case 3: 0CCA, 0300, 0CD5 (admittedly unusual)
>
> which clearly should not compose since ccc(0300) >= ccc(0CD5)
> (http://www.unicode.org/review/pr-29.html)
>
> Python, however, (incorrectly) yields: 0CCB, 0300
That looks like a bug in Python.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Sep 22 2009 - 16:47:46 CDT