More Permanent Faults? - Unicode 5.0 Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jun 08 2006 - 14:00:27 CDT

Next message: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

Previous message: David Starner: "Re: Glyphs for German quotation marks"
Next in thread: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: George W Gerrity: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There appear to be bugs in the definition of the case-folding function
toCasefold() as currently defined by
http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt and Section
3.13 of TUS 4.1.0. (I am using the latter as I cannot find a more reliable
draft of Section 3.13 in TUS 5.0.) This matters, for toCasefold() of NFKC
strings valid in Unicode 5.0 is about to be frozen forever. Should these
faults be made permanent?

I have found two groups of NFKC grapheme clusters which fail to match their
default uppercasings after conversion to NFD in one of the important
'case-insensitive' matching methods. I haven't reported these problems
formally yet - I'd like to see what other people think first. It's
conceivable that I'm the only person bothered by them.

*Problem 1*

The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>

The problem with this only occurs when using the default mappings. A
different can of worms opens up for Turkic locales - I don't know whether
the behaviour is fully defined for Turkic locales. This grapheme cluster is
in all four normalised forms. According to
http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt and
http://www.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d11.txt , its
uppercasing (in all locales) is

<U+0049 LATIN CAPITAL LETTER I>

which is in all four normal forms.

To compare these strings for 'canonical caseless matches', one calculates
NFD(toCasefold(NFD())) of the strings. By
http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
default casefoldings, whether simple or full, are <U+0131> and <U+0069 LATIN
SMALL LETTER I>. These are not canonically equivalent. QED.

Incidentally, the definition of default casefolding contradicts the
definition of casefolding given in TUS 4.1.0 Section 5.18.

There are two alternative solutions:
(a) Remove the upper- and title-casings for U+0131 from UnicodeData.txt and
uncomment out the Turkic data for U+0131 in SpecialCasing.dat, also making
it apply to Azer(baijan)i.
(b) Add two lines to SpecialCasing.dat:

0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
0131; T; 0131; # LATIN SMALL LETTER DOTLESS I

*Problem 2*

The second group is probably much less troublesome, but is quite awkward.
There are two plausible NFC and NFKC sequences

<U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0306 COMBINING BREVE>

<U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0304 COMBINING
MACRON>

The former might occur if one were using a breve (or brachy) to explicitly
mark the lack of stress in polytonic Modern Greek, for example in explaining
the meter of poetry. The second might occur if one decided to redundantly
mark Classical Greek vowel length - the macron is redundant, for the
subscript iota implies that the vowel is long. I don't have any examples of
these combinations. I will work with the latter.

Converted to NFD, it yields

<03B1 GREEK SMALL LETTER ALPHA, U+0304, 0345 COMBINING GREEK YPOGEGRAMMENI>

The default uppercasing (not uncontroversial, but that's a linguistic
matter) is
<0391 GREEK CAPITAL LETTER ALPHA, U+0304, U+0399 GREEK CAPITAL LETTER IOTA>,
whose NFC and NFKC form is

<1FB9 GREEK CAPITAL LETTER ALPHA WITH MACRON, U+0399 GREEK CAPITAL LETTER
IOTA>

Now, the case-insensitive match whose outcome is guaranteed to be stable
under the case-folding stability policy
(http://www.unicode.org/standard/stability_policy.html) is given by
toCasefold() of NFKC strings.

Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1 GREEK
SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>, while
toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER ALPHA WITH
MACRON, U+03B9>. But the casefolded forms are not canonically equivalent!

The problem here is that the definition of toCasefold() offers no hint that
when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be hidden in a
precomposed form, is detached as U+03B9, it should be moved to after any
immediately following characters of non-zero combining class (and characters
that decompose solely to such - U+0F73 and U+0F75.) SpecialCasing.txt has
at least an implication that such should be done when a U+0399 detaches
itself, but I find it hard to read it as normative.

This type of problem does not occur with NF(K)D strings - the combining
class of U+0345 forces it to the end of the cluster. It is for this reason
that the formal definitions of canonical and compatibility caseless matches
use NFD and NFKD respectively.

Richard.

Next message: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Previous message: David Starner: "Re: Glyphs for German quotation marks"
Next in thread: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Mark Davis: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: George W Gerrity: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 14:19:52 CDT