Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings

From: Richard Wordingham (
Date: Thu Oct 19 2006 - 17:59:43 CST

  • Next message: Addison Phillips: "Re: Differences between UnicodeData.txt and SpecialCasing.txt Case Mappings"

    Andrew Miller wrote on Thursday, October 19, 2006 11:44 PM

    > There appear to be a number of differences in the case mappings defined in
    > UnicodeData.txt and SpecialCasing.txt

    > Can I just ignore the UnicodeData.txt mappings for these characters, and
    > just use the ones defined in SpecialCasing ones instead?

    Just using UnicodeData.txt gives one the 'simple default case mappings';
    overriding it with SpecialCasing.txt gives one the 'full default case
    mappings' (TUS 4.0 Section 3.13).

    To keep things 'simple', a process performing a 'full default case mapping'
    is not a Unicode-conformant process in the sense of TUS 4.0 Section 3.2 C9.
    (I gave bad advice in the past because I thought it would be.) The default
    case mappings work on strings of characters, not strings of characters
    modulo canonical equivalence, and only work properly if the ypogegrammeni
    containing character is not followed by a character of lesser non-zero
    combining class. (Currently - Unicode 5.0 - all other combining characters
    have lesser non-zero combining class, but this may well not be true in
    Unicode 5.1.)

    Note that U+0131 LATIN SMALL LETTER DOTLESS I does not case-fold with
    anything (except in the 'standard' Turkish customisation) - I'm told this
    peculiar behaviour will be documented in TUS 5.0.

    The Lithuanian and Turkish case mappings only work well for these languages
    and those like them - unusual accents on 'i's will cause total confusion.


    This archive was generated by hypermail 2.1.5 : Thu Oct 19 2006 - 18:01:48 CST