Re: Case mapping of dotless lowercase letters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 16 2003 - 20:10:47 EST

  • Next message: Kenneth Whistler: "Re: Case mapping of dotless lowercase letters"

    John Cowan noted:

    <quote>
    Here's what happens exactly:

     source simple case folding full case folding tr/az case folding
     dotted i dotted i dotted i dotted i
     dotless i dotless i dotless i dotless i
     dotted I dotted I dotted i + comb. dot dotted i
     dotless I dotted i dotted i dotless i
    </quote>

    Add to that specification of the case *folding* (from
    CaseFolding.txt), the default case *mappings* (from
    UnicodeData.txt):

     source default lc mapping default uc mapping
     dotted i dotted i (dotless) I
     dotless i dotless i (dotless) I
     dotted I dotted i dotted I
     (dotless) I dotted i (dotless) I
     
    If you are case *folding* you are doing one thing; if you are
    case *mapping* you are doing another.

    Case *folding* creates equivalence classes for different sequences.

    Simple case folding, as defined above, creates the following
    equivalence classes, adding in the sequences involving use of
    the combining dot as well.

       A. { i, I }
       B. { dotless i }
       C. { dotted I }
       D. { <i, dot above>, <I, dot above> }
       E. { <dotless i, dot above> }
       F. { <dotted I, dot above> }
       
    These 6 classes are distinguished. They do not conflate, although
    in class A and in class D, there are two sequences which do fold
    together.

    Full case folding, as defined above, creates the following
    equivalence classes.

       A. { i, I }
       B. { dotless i }
       G. { dotted I, <i, dot above>, <I, dot above> }
       E. { <dotless i, dot above> }
       F. { <dotted I, dot above> }
       
    In other words, there are now 5, not 6 equivalence classes, as the
    classes C and D from simple case folding have been conflated.

    Turkic/Azeri case folding, as defined above, creates the following
    equivalence classes.

       H. { i, dotted I }
       I. { dotless i, I }
       J. { <i, dot above>, <dotted I, dot above> }
       K. { <dotless i, dot above>, <I, dot above> }
       
    And now there are 4 *different* equivalence classes, which group
    together the sequences which make sense for Turkish/Azeri.

    Note that none of the 3 sets of equivalence classes violates
    *canonical* equivalence, because none of the 8 sequences involved
    is canonically equivalent to any other. In other words, no matter
    which of the 3 approaches you take to case folding, in no instance
    are you claiming that canonically equivalent sequences are to be
    interpreted differently.

    Now let's look at what happens with case *mapping*, using the
    default mappings of UnicodeData.txt.

    Lowercasing first:

       L. { i, I, dotted I } --> i
       B. { dotless i } --> dotless i
       M. { <i, dot above>, <I, dot above>, <dotted I, dot above> }
                             --> <i, dot above>
       E. { <dotless i, dot above> } --> <dotless i, dot above>
       
    Uppercasing next:

       N. { i, I, dotless i } --> I
       C. { dotted I } --> dotted I
       O. { <i, dot above>, <I, dot above>, <dotless i, dot above> }
                             --> <I, dot above>
       F. { <dotted I, dot above> } --> <dotted I, dot above>
       
    The classes of sequences that get conflated are different here. In
    particular, classes L, M, N, O conflate characters that are not
    conflated by the formal definition of case folding.

    So, in particular, one should *not* expect the results of case
    mapping, followed by a binary comparison, to be the same as
    a formal case folding comparison. There will be differences.
    Any implementation that does not take this into account is still
    confused (aren't we all?) in its handling of these letters.

    Now add to that the problem of which of the elements in the
    equivalence classes *look* the same, and you have the potential
    for even more confusion. In particular, in simple case folding,
    you have the equivalence classes:

       A. { i, I }
       E. { <dotless i, dot above> }
      
    Members of class E are *not* equivalent to members of class A.
    But of course, <dotless i, dot above> *looks like* i and does
    *not* look like I. Add in the others, plus all the potential
    differences in how fonts may implemented the soft-dotted
    property, and this entire area can lead to total confusion.

    One moral of the story is: DO NOT USE COMBINING DOTS WITH I's.

    If you subtract out all the superfluous combinations cited above
    with combining dots (for completeness), then the situation
    becomes much simpler and more comprehensible:

    Simple case folding. [disallows string length change]

       A. { i, I }
       B. { dotless i }
       C. { dotted I }
       
    Full case folding. [allows string length change]

       A. { i, I }
       B. { dotless i }
       G. { dotted I } [represented in folded form as <i, dot above>]
       
    Turkic/Azeri case folding.

       H. { i, dotted I }
       I. { dotless i, I }

    Lowercasing:

       L. { i, I, dotted I } --> i
       B. { dotless i } --> dotless i
       
    Uppercasing:

       N. { i, I, dotless i } --> I
       C. { dotted I } --> dotted I

    Add in Turkic locale-specific special casing.

    Lowercasing:

       H. { i, dotted I } --> i
       I. { dotless i, I } --> dotless i
       
    Uppercasing:

       H. { i, dotted I } --> dotted I
       I. { dotless i, I } --> I
       
    That is *still* complicated enough. But you could at least copy that
    out, paste it on the wall, and expect an engineer to get it right
    in an implementation.

    By the way, the UTC has been over this stuff so many times that the
    topic is by now one that elicits groans of "Not those damn Turkish
    i's again!" when brought up in the meetings. It is very unlikely that
    the current specification is going to be changed again in any
    way. Nothing anyone could do could improve the situation. All it
    would accomplish would be to destabilize any implementation that people
    already have of this stuff.

    Anyone who -- in Unicode data -- adds combining dots to i's deserves
    the trouble they will get into. And anyone who tries to represent
    dotted i's by putting combining dots on dotless i's also deserves
    the trouble they will get into. (The same will be true of j's, once
    the recently approved dotless j character is published.)

    Also, beware of two of the big warnings provided in the Unicode
    Standard and the Unicode Character Database about this stuff:

    I. No casing operations are reversible.

    II. Casing operations ... do not preserve normalization form.
        (This is true both of case mapping and of case folding.)
        
    And, as the Turkish i's illustrate, case mappings are not
    one-to-one in a functional sense. A lowercasing may conflate
    two distinct uppercase characters into a single lowercase,
    and an uppercasing may conflate two distinct lowercase
    characters into a single uppercase.

    Ignore these facts at your peril and at the peril of the customers
    who depend on your implementations.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 20:54:52 EST