Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 21 2006 - 07:26:38 CST

  • Next message: Philippe Verdy: "Test your web browser! Unicode 5.0 charts in HTML on French Wikipedia"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > To keep things 'simple', a process performing a 'full default case mapping'
    > is not a Unicode-conformant process in the sense of TUS 4.0 Section 3.2 C9.
    > (I gave bad advice in the past because I thought it would be.) The default
    > case mappings work on strings of characters, not strings of characters
    > modulo canonical equivalence, and only work properly if the ypogegrammeni
    > containing character is not followed by a character of lesser non-zero
    > combining class. (Currently - Unicode 5.0 - all other combining characters
    > have lesser non-zero combining class, but this may well not be true in
    > Unicode 5.1.)
    >
    > Note that U+0131 LATIN SMALL LETTER DOTLESS I does not case-fold with
    > anything (except in the 'standard' Turkish customisation) - I'm told this
    > peculiar behaviour will be documented in TUS 5.0.

    If the conversion of a combining small iota subscript is to a non-combining capital iota, then this is not a simple case mapping because it breaks canonical equivalence, and this should not occur in a full default case mapping, even if this involves only a one-to-one mapping.

    My opinion is that such combining small iota subscript should have NO default (one-to-one) mapping in the main UCD file. I'd like to have details about which *precomposed* greek letter with a iota subscript has no corresponding case mapping to another case as a *precomposed* greek letter with a iota subscript.

    For example, as long as:

    * SMALL GREEK LETTER ALPHA WITH IOTA SUBSCRIPT, and
    * CAPITAL GREEK LETTER ALPHA WITH IOTA SUBSCRIPT

    are associated with corresponding case mappings, a simple case conversion (based on one-to-one mappings) will not break the canonical equivalence, as long as:

    ==> the COMBINING GREEK SMALL IOTA SUBSCRIPT has NO default case mapping.

    If you map this combining subscript to a non-combining capital iota, then the case mapping algorithm MUST also decompose the precomposed small or capital greek letters with iota subscript, and the mapping algorithm must reorder the combining iota at end of the combining sequence before applying its mapping to a capital iota. This means that a simple case mapping algorithm CANNOT apply only simple one-to-one ruleswithout reordering.

    And this process will transform one combining sequence into two combining sequences and is not reversible (a non-combining capital iota will map only to a combining small iota, without reconverting it to a combining iota, as this would break the canonical equivalence. For this reason, to be a conforming process, the following eight sequences:

    * SMALL GREEK LETTER ALPHA WITH IOTA SUBSCRIPT, <optional other combining characters>
    * SMALL GREEK LETTER ALPHA, COMBINING GREEK IOTA SUBSCRIPT, <optional other combining characters>
    * SMALL GREEK LETTER ALPHA, <optional other combining characters>, SMALL GREEK LETTER IOTA
    * SMALL GREEK LETTER ALPHA, <optional other combining characters>, CAPITAL GREEK LETTER IOTA
    * CAPITAL GREEK LETTER ALPHA WITH IOTA SUBSCRIPT, <optional other combining characters>
    * CAPITAL GREEK LETTER ALPHA, COMBINING GREEK IOTA SUBSCRIPT, <optional other combining characters>
    * CAPITAL GREEK LETTER ALPHA, <optional other combining characters>, SMALL GREEK LETTER IOTA
    * CAPITAL GREEK LETTER ALPHA, <optional other combining characters>, CAPITAL GREEK LETTER IOTA

    will all have the SAME mappings to folding-case or lowercase or titlecase or uppercase:

    * SMALL GREEK LETTER ALPHA, <optional other combining characters>, SMALL GREEK LETTER IOTA (for the lowercase or case-folding mappings)
    * CAPITAL GREEK LETTER ALPHA, <optional other combining characters>, SMALL GREEK LETTER IOTA (for the titlecase mapping)
    * CAPITAL GREEK LETTER ALPHA, <optional other combining characters>, CAPITAL GREEK LETTER IOTA (for the uppercase mapping)

    With this last rule, the case mapping algorithms are Unicode conforming processes (that remove the distinction between subscript and normal iotas). As a semanctic distinction is removed, it should only be used within processes that assume a compatibility equivalence.

    If the conforming process only assumes a canonical equivalence, iota subscripts should NEVER be transformed to capital iotas, and the simple mapping should keep the iota subscripts as subscripts even in the case of precomposed letters (however the base letter in the composition may of course be mapped to another case in a precomposition or in a decomposed combining sequence, depending on the expected NFC or NFD form of the result).

    Any other case mapping (anywhere else in Unicode codepoints/characters) that includes a transformation of a combining character into a non-combining one must then be studied with extreme care.

    The same care must be studied if a conforming process filters out any base character (such as controls or other ignorable characters, for example during collation, or in some Nameprep algorithm, for IDN or similar applications).

    Another example. A transformation process that also removes all combining characters from a string and applies a case mapping is also very risky: the iota subscript MUST NOT be removed whenever a base iota letter is not removed as well with this process (but instead the subscript may be converted to a non-combining letter as indicated above). Otherwise this process will break canonical equivalence of the result, and so such process will not be conforming to Unicode.



    This archive was generated by hypermail 2.1.5 : Sat Oct 21 2006 - 07:28:29 CST