Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jun 08 2006 - 21:40:16 CDT

  • Next message: Doug Ewell: "Re: Case folding"

    Philippe Verdy wrote on Thursday, June 08, 2006 11:37 PM
    Subject: Re: More Permanent Faults? - Unicode 5.0 Casefolding

    > From: "Richard Wordingham" richard.wordingham@ntlworld.com

    > Actually, to compare strings for canonical caseless matches, one must
    > calculate the *closure* of NFD() and toCasefold() transform fonctions.
    > This means applying NFD() and toCasefold() alternatively as long as the
    > result is still different. So this many mean:
    > toCasefold(NFD(toCasefold(NFD(string)))) or even more applications of the
    > functions.

    The implication is that the composite map NFD·toCasefold·NFD is the closure;
    do you have a counter-example in mind? I took the definition from TUS 3.13.

    > I would certainly not do that; the common casefolding of a small dotless i
    > is a small dotless i, not a normal (dotted) small i. This means that,
    > outside of Turkic locales, a small dotless i does NOT matches a normal
    > (dotted) small i in caseless searches, but it does match a (possibly
    > Turkic...or not) dotted capital I.

    Matching small dotless I and dotted capital I has only symmetry to recommend
    it.

    > But some caseless searches are implemented by actually comparing the
    > result of:
    > toUppercase(closure{NFD,toCaseFold}(string))

    > This gives different results, because then it will match all 4 variants of
    > i (small or capital, with or without dot). But this case is known and
    > complicate to handle.

    This looks like a way to get round a long-running inconsistency in the
    casefolding of U+0131. I've been thinking a lot over the past few days
    about just what casefolding means. The upper- and lower-casing functions
    can be thought of as relationships on strings ('is the upper case form of'
    and 'is the lower case form of'), and as such they generate an equivalence
    relationship on strings. The casefolding function is then an idempotent
    function such that
    (a) 'is the casefolding form of' generates the above equivalence
    relationship; and
    (b) preserves canonical equivalence.

    There are lots of discussions regarding this implicit dot (over small i or
    small j, and the way it is transformed in combination with other
    diacritics). In practice, the Turkic alternative for case mappings is not
    complete in SpecialCasing.dat,and I think it is not really normative for
    most protocols that need caseless compares: to be complete, one would need
    to make the combining dot above completely ignorable when it is used after a
    letter with an implicit dot in any of its letter case.

    I believe only the default case-mappings and case-folding are normative.
    The Turkic mappings have to be incomplete until one can determine what SMALL
    LETTER I WITH ACUTE and the like capitalise to. Lithuanian lower-casing
    does not preserve canonical equivalence, for U+00CF LATIN CAPITAL LETTER I
    WITH DIAERESIS and its decomposition lower-case inequivalently by the rules.
    Nevertheless, one can derive a Lithuanian case folding. In fact, one can
    derive several reasonable-looking equivalent ones that meet the definition
    above. The one that is algorithmically derivable does not preserve
    canonical equivalence. However, but I have not double checked, there is an
    equivalent Lithuanian case-folding that works by adding the following rules:

    0307; L; After_Soft_Dotted; # COMBINING DOT ABOVE
    00CC; L; 0069 0300; # LATIN CAPITAL LETTER I WITH GRAVE
    00CD; L; 0069 0301; # LATIN CAPITAL LETTER I WITH ACUTE
    00EC; L; 0069 0300; # LATIN SMALL LETTER I WITH GRAVE
    00ED; L; 0069 0301; # LATIN SMALL LETTER I WITH ACUTE
    0128; L; 0069 0303; # LATIN CAPITAL LETTER I WITH TILDE
    0129; L; 0069 0303; # LATIN SMALL LETTER I WITH TILDE

    It preserves canonical equivalence if you fix the issue of U+0131.

    > Then consider the case of the Dutch ij ligated letter: should it match the
    > ij letter pair? then how do you consider the dots that are written above
    > the ij ligated letter? couldn't it be perceived as a diaeresis above a ij
    > pair of letters? We are exactly on borderline cases.

    I was just considering the formal requirement. As jou can't encode the
    Dutch ligature that way, the issue doesn't arise. I agree that practically
    you should do a compatibility decomposition on it, but that is not the
    Unicode default casefolding.

    > So, is, the caseFolding() operation really normative?

    Yes, the *default* toCasefold() is normative.

    > shouldn't it be reformulated using the standard Unicode collation
    > algorithm, which is much less ambiguous and can handle much more languages
    > than what SpecialCasing.txt is currently providing?

    This would probably be more useful. TUS already suggests that approach.

    Richard.



    This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 21:45:59 CDT