Case mapping of dotless lowercase letters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 15 2003 - 10:15:19 EST

  • Next message: Carl W. Brown: "[OT] Euro-English (was: Corea? (Re: Swastika to be banned by Microsoft?)"

    I have a minor problem related to the case folding (for searches) of dotless
    lowercase letters, and I don't know why there's no case mapping defined for
    them, when performing full case folding (I have no problem for simple case
    mappings).

    We currently have these full mappings for uppercase letters:

    0049; C; 0069; # LATIN CAPITAL LETTER I
            -> LATIN SMALL LETTER I
    0049; T; 0131; # LATIN CAPITAL LETTER I
            -> LATIN CAPITAL LETTER I WITH DOT ABOVE

    0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
            -> LATIN SMALL LETTER I, COMBINING DOT ABOVE
    0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
            -> LATIN SMALL LETTER I

    I would have expected to find these mappings:

    0130; F; 0069; # LATIN SMALL LETTER DOTLESS I
            -> LATIN SMALL LETTER I
    0130; T; 0130; # LATIN SMALL LETTER DOTLESS I
            -> LATIN SMALL LETTER DOTLESS I

    The rationale being that the locale-neutral mappings would not differentiate
    the "standard" small letter (soft-dotted) i, and the "Turkic" small letter
    dotless i, for the same reason that they do not differentiate their
    uppercase versions; and that the "Turkic" mappings should maintain this
    difference in both lowercase and uppercase pairs of letters.

    This is quite irritating, because original strings that are distinct with
    case folding will not remain distinct with case folding, if they are first
    converted to uppercase. Of course the mapping below would be a no-op:

    0130; T; 0130; # LATIN SMALL LETTER DOTLESS I
            -> LATIN SMALL LETTER DOTLESS I

    but it would be needed in Turkic languages to override the locale-neutral
    full case mapping:

    0130; F; 0069; # LATIN SMALL LETTER DOTLESS I
            -> LATIN SMALL LETTER I

    In fact there are also occurences where small dotless i are used in
    non-Turkic languages, where both versions compare equally, notably when
    there is another diacritic above that soft-dotted letter.

    With an above diacritic, the letters should coherently compare equal with
    case folding in non-Turkic languages, but they still should compare equally
    in Turkic languages in that case (the above proposed mapping will not detect
    this, meaning that a lowercase letter i with a diacritic above should be
    encoded always with the standard (soft-dotted) i.

    Such case folding issue does not occur for the lowercase German sharp S
    (ess-tsett), which is correctly mapped with this full case folding:

    00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
            -> LATIN SMALL LETTER S, LATIN SMALL LETTER S

    (this is shown as a proof that case foldings can be defined for lowercase
    letters, and that conforming applications that use case folding must not
    rely on the "Ll" general category to see if case folding can be avoided, and
    not even on the absence of a simple lowercase mapping in the main
    UnicodeData.txt file).

    The same comment applies to the difference between the standard
    "soft-dotted" j and the new dotless j...

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com



    This archive was generated by hypermail 2.1.5 : Mon Dec 15 2003 - 10:56:43 EST