RE: Case Mapping Definitions (was: Adding Lowercase Letters)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 09 2007 - 13:40:59 CDT

  • Next message: Addison Phillips: "Re: Œœ on IBM AIX"

    Richard Wordingham wrote:
    > Philippe Verdy wrote on Tuesday, May 08, 2007 at 8:57 PM
    > > Are we guaranteed to have, with existing normative Unicode definitions
    > > and stability rules, for every string S in a locale L, the following
    > > equalities starting at some current orpast version of the Unicode
    > > standard and in all future versions:
    > >
    > > toCaseFold(toLowerCase(S, L), L)
    > > = toCaseFold(toUpperCase(S, L), L)
    > > = toCaseFold(toTitleCase(S, L), L)
    > >
    > > Are there existing exceptions?
    >
    > Yes. U+0131 LATIN SMALL LETTER DOTLESS I lowercases and casefolds to
    > itself, but uppercases and titlecases to U+0049 LATIN CAPITAL LETTER I,
    > which then casefolds in the default casefolding to U+0069 LATIN SMALL
    > LETTER
    > I.
    >
    > U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE misbehaves similarly (mutatis
    > mutandis) in the default simple mappings.

    Hmmm. Although I remembered the effect of lowercase and uppercase/titlecase
    mappings on these letters, I did not remember that this applied to the case
    folding mapping.

    This is really unfortunate, because the effect of casefolding should exactly
    erase the effect of these differences, i.e:
    - When locale L is Turkish or Azeri, all case mappings should preserve the
    difference between dotted and undotted letters i
    - When locale L is neutral or other than Turkish or Azeri, the case folding
    should map all four letters to the same letter, ignoring the soft dot.
    - Case folding does not have to be lowercase or uppercase, it just have to
    be consistent and return one string of the equivalence classes of strings
    that are mapped to it (i.e. each equivalence class should contain one member
    whose identity is not changed by the case folding)

    Note: I am not speaking here about the case mappings of individual
    characters in the UCD, but about the general algorithm that works on any
    Unicode text, even if such text contains "defective" sequences: this is what
    nameprep for IDN needs to work on, because it handles strings (domain name
    labels) not just individual characters, and only at this level the effect on
    canonical equivalent input strings must be guaranteed to make the nameprep
    process compliant with Unicode rules.



    This archive was generated by hypermail 2.1.5 : Wed May 09 2007 - 13:43:13 CDT