Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jun 08 2006 - 17:37:02 CDT

  • Next message: Mike: "Case folding"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > I have found two groups of NFKC grapheme clusters which fail to match their
    > default uppercasings after conversion to NFD in one of the important
    > 'case-insensitive' matching methods. I haven't reported these problems
    > formally yet - I'd like to see what other people think first. It's
    > conceivable that I'm the only person bothered by them.
    >
    > *Problem 1*
    >
    > The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>
    >
    > The problem with this only occurs when using the default mappings. A
    > different can of worms opens up for Turkic locales - I don't know whether
    > the behaviour is fully defined for Turkic locales. This grapheme cluster is
    > in all four normalised forms. According to
    > http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt and
    > http://www.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d11.txt , its
    > uppercasing (in all locales) is
    >
    > <U+0049 LATIN CAPITAL LETTER I>
    >
    > which is in all four normal forms.
    >
    > To compare these strings for 'canonical caseless matches', one calculates
    > NFD(toCasefold(NFD())) of the strings. By
    > http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
    > default casefoldings, whether simple or full, are <U+0131> and <U+0069 LATIN
    > SMALL LETTER I>. These are not canonically equivalent. QED.

    Actually, to compare strings for canonical caseless matches, one must calculate the *closure* of NFD() and toCasefold() transform fonctions. This means applying NFD() and toCasefold() alternatively as long as the result is still different. So this many mean: toCasefold(NFD(toCasefold(NFD(string)))) or even more applications of the functions.

    > Incidentally, the definition of default casefolding contradicts the
    > definition of casefolding given in TUS 4.1.0 Section 5.18.
    >
    > There are two alternative solutions:
    > (a) Remove the upper- and title-casings for U+0131 from UnicodeData.txt and
    > uncomment out the Turkic data for U+0131 in SpecialCasing.dat, also making
    > it apply to Azer(baijan)i.
    > (b) Add two lines to SpecialCasing.dat:
    >
    > 0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
    > 0131; T; 0131; # LATIN SMALL LETTER DOTLESS I

    I would certainly not do that; the common casefolding of a small dotless i is a small dotless i, not a normal (dotted) small i. This means that, outside of Turkic locales, a small dotless i does NOT matches a normal (dotted) small i in caseless searches, but it does match a (possibly Turkic...or not) dotted capital I.

    But some caseless searches are implemented by actually comparing the result of:
    toUppercase(closure{NFD,toCaseFold}(string))

    This gives different results, because then it will match all 4 variants of i (small or capital, with or without dot). But this case is known and complicate to handle. There are lots of discussions regarding this implicit dot (over small i or small j, and the way it is transformed in combination with other diacritics). In practice, the Turkic alternative for case mappings is not complete in SpecialCasing.dat,and I think it is not really normative for most protocols that need caseless compares: to be complete, one would need to make the combining dot above completely ignorable when it is used after a letter with an implicit dot in any of its letter case.

    Then consider the case of the Dutch ij ligated letter: should it match the ij letter pair? then how do you consider the dots that are written above the ij ligated letter? couldn't it be perceived as a diaeresis above a ij pair of letters? We are exactly on borderline cases.

    For IDN, applications, it was chosen to make the combining dot above fully ignorable, and so making all i's equivalent, as well as all j's (notice that IDN use compatibility decompositions).

    You may also consider "caseless compares" as the other way to specify in fact a comparison based on the primary level of collation. In that case, it becomes locale-dependant, and tailorable in each locale. But the default collation table is well known and locale independant (but I can't remember if the primar level differences or equivalence classes in the DUCET is normative or informative, someone may verify it again, but i think it was just informative).

    So, is, the caseFolding() operation really normative? shouldn't it be reformulated using the standard Unicode collation algorithm, which is much less ambiguous and can handle much more languages than what SpecialCasing.txt is currently providing?

    One way to specify what caseFolding() should return is to look for the string with the smallest codepoint values and whose primary level collation keys are equal with those of the characters in source string, when using the DUCET (or a tailored collation when computing locale-sensitive caseFolding), then to assemble the result into a string in NFC form.



    This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 17:42:00 CDT