More Permanent Faults? - Unicode 5.0 Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Thu Jun 08 2006 - 14:00:27 CDT

  • Next message: Philippe Verdy: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

    There appear to be bugs in the definition of the case-folding function
    toCasefold() as currently defined by
    http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt and Section
    3.13 of TUS 4.1.0. (I am using the latter as I cannot find a more reliable
    draft of Section 3.13 in TUS 5.0.) This matters, for toCasefold() of NFKC
    strings valid in Unicode 5.0 is about to be frozen forever. Should these
    faults be made permanent?

    I have found two groups of NFKC grapheme clusters which fail to match their
    default uppercasings after conversion to NFD in one of the important
    'case-insensitive' matching methods. I haven't reported these problems
    formally yet - I'd like to see what other people think first. It's
    conceivable that I'm the only person bothered by them.

    *Problem 1*

    The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>

    The problem with this only occurs when using the default mappings. A
    different can of worms opens up for Turkic locales - I don't know whether
    the behaviour is fully defined for Turkic locales. This grapheme cluster is
    in all four normalised forms. According to
    http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt and
    http://www.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d11.txt , its
    uppercasing (in all locales) is

    <U+0049 LATIN CAPITAL LETTER I>

    which is in all four normal forms.

    To compare these strings for 'canonical caseless matches', one calculates
    NFD(toCasefold(NFD())) of the strings. By
    http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
    default casefoldings, whether simple or full, are <U+0131> and <U+0069 LATIN
    SMALL LETTER I>. These are not canonically equivalent. QED.

    Incidentally, the definition of default casefolding contradicts the
    definition of casefolding given in TUS 4.1.0 Section 5.18.

    There are two alternative solutions:
    (a) Remove the upper- and title-casings for U+0131 from UnicodeData.txt and
    uncomment out the Turkic data for U+0131 in SpecialCasing.dat, also making
    it apply to Azer(baijan)i.
    (b) Add two lines to SpecialCasing.dat:

    0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
    0131; T; 0131; # LATIN SMALL LETTER DOTLESS I

    *Problem 2*

    The second group is probably much less troublesome, but is quite awkward.
    There are two plausible NFC and NFKC sequences

    <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0306 COMBINING BREVE>

    <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0304 COMBINING
    MACRON>

    The former might occur if one were using a breve (or brachy) to explicitly
    mark the lack of stress in polytonic Modern Greek, for example in explaining
    the meter of poetry. The second might occur if one decided to redundantly
    mark Classical Greek vowel length - the macron is redundant, for the
    subscript iota implies that the vowel is long. I don't have any examples of
    these combinations. I will work with the latter.

    Converted to NFD, it yields

    <03B1 GREEK SMALL LETTER ALPHA, U+0304, 0345 COMBINING GREEK YPOGEGRAMMENI>

    The default uppercasing (not uncontroversial, but that's a linguistic
    matter) is
    <0391 GREEK CAPITAL LETTER ALPHA, U+0304, U+0399 GREEK CAPITAL LETTER IOTA>,
    whose NFC and NFKC form is

    <1FB9 GREEK CAPITAL LETTER ALPHA WITH MACRON, U+0399 GREEK CAPITAL LETTER
    IOTA>

    Now, the case-insensitive match whose outcome is guaranteed to be stable
    under the case-folding stability policy
    (http://www.unicode.org/standard/stability_policy.html) is given by
    toCasefold() of NFKC strings.

    Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1 GREEK
    SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>, while
    toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER ALPHA WITH
    MACRON, U+03B9>. But the casefolded forms are not canonically equivalent!

    The problem here is that the definition of toCasefold() offers no hint that
    when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be hidden in a
    precomposed form, is detached as U+03B9, it should be moved to after any
    immediately following characters of non-zero combining class (and characters
    that decompose solely to such - U+0F73 and U+0F75.) SpecialCasing.txt has
    at least an implication that such should be done when a U+0399 detaches
    itself, but I find it hard to read it as normative.

    This type of problem does not occur with NF(K)D strings - the combining
    class of U+0345 forces it to the end of the cluster. It is for this reason
    that the formal definitions of canonical and compatibility caseless matches
    use NFD and NFKD respectively.

    Richard.



    This archive was generated by hypermail 2.1.5 : Thu Jun 08 2006 - 14:19:52 CDT