Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: George W Gerrity (g.gerrity@gwg-associates.com.au)
Date: Mon Jun 12 2006 - 03:41:00 CDT

  • Next message: Doug Ewell: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

    On 2006-06-09, at 05:00, Richard Wordingham wrote:

    > There appear to be bugs in the definition of the case-folding
    > function toCasefold() as currently defined by http://
    > www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt and
    > Section 3.13 of TUS 4.1.0. (I am using the latter as I cannot find
    > a more reliable draft of Section 3.13 in TUS 5.0.) This matters,
    > for toCasefold() of NFKC strings valid in Unicode 5.0 is about to
    > be frozen forever. Should these faults be made permanent?
    >
    > I have found two groups of NFKC grapheme clusters which fail to
    > match their default uppercasings after conversion to NFD in one of
    > the important 'case-insensitive' matching methods. I haven't
    > reported these problems formally yet - I'd like to see what other
    > people think first. It's conceivable that I'm the only person
    > bothered by them.
    >
    > *Problem 1*
    >
    > The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>
    >
    > The problem with this only occurs when using the default mappings.
    > A different can of worms opens up for Turkic locales - I don't know
    > whether the behaviour is fully defined for Turkic locales. This
    > grapheme cluster is in all four normalised forms. According to
    > http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt
    > and http://www.unicode.org/Public/5.0.0/ucd/
    > UnicodeData-5.0.0d11.txt , its uppercasing (in all locales) is
    >
    > <U+0049 LATIN CAPITAL LETTER I>
    >
    > which is in all four normal forms.
    >
    > To compare these strings for 'canonical caseless matches', one
    > calculates NFD(toCasefold(NFD())) of the strings. By http://
    > www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
    > default casefoldings, whether simple or full, are <U+0131> and <U
    > +0069 LATIN SMALL LETTER I>. These are not canonically
    > equivalent. QED.
    >
    > Incidentally, the definition of default casefolding contradicts the
    > definition of casefolding given in TUS 4.1.0 Section 5.18.
    >
    > There are two alternative solutions:
    > (a) Remove the upper- and title-casings for U+0131 from
    > UnicodeData.txt and uncomment out the Turkic data for U+0131 in
    > SpecialCasing.dat, also making it apply to Azer(baijan)i.
    > (b) Add two lines to SpecialCasing.dat:
    >
    > 0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
    > 0131; T; 0131; # LATIN SMALL LETTER DOTLESS I

    Is it a legitimate solution to create a new codepoint for CAPITAL
    DOTLESS I?

    > *Problem 2*
    >
    > The second group is probably much less troublesome, but is quite
    > awkward. There are two plausible NFC and NFKC sequences
    >
    > <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0306
    > COMBINING BREVE>
    >
    > <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0304
    > COMBINING MACRON>
    >
    > The former might occur if one were using a breve (or brachy) to
    > explicitly mark the lack of stress in polytonic Modern Greek, for
    > example in explaining the meter of poetry. The second might occur
    > if one decided to redundantly mark Classical Greek vowel length -
    > the macron is redundant, for the subscript iota implies that the
    > vowel is long. I don't have any examples of these combinations. I
    > will work with the latter.
    >
    > Converted to NFD, it yields
    >
    > <03B1 GREEK SMALL LETTER ALPHA, U+0304, 0345 COMBINING GREEK
    > YPOGEGRAMMENI>
    >
    > The default uppercasing (not uncontroversial, but that's a
    > linguistic matter) is
    > <0391 GREEK CAPITAL LETTER ALPHA, U+0304, U+0399 GREEK CAPITAL
    > LETTER IOTA>, whose NFC and NFKC form is
    >
    > <1FB9 GREEK CAPITAL LETTER ALPHA WITH MACRON, U+0399 GREEK CAPITAL
    > LETTER IOTA>
    >
    > Now, the case-insensitive match whose outcome is guaranteed to be
    > stable under the case-folding stability policy (http://
    > www.unicode.org/standard/stability_policy.html) is given by
    > toCasefold() of NFKC strings.
    >
    > Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1
    > GREEK SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>,
    > while toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER
    > ALPHA WITH MACRON, U+03B9>. But the casefolded forms are not
    > canonically equivalent!
    >
    > The problem here is that the definition of toCasefold() offers no
    > hint that when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be
    > hidden in a precomposed form, is detached as U+03B9, it should be
    > moved to after any immediately following characters of non-zero
    > combining class (and characters that decompose solely to such - U
    > +0F73 and U+0F75.) SpecialCasing.txt has at least an implication
    > that such should be done when a U+0399 detaches itself, but I find
    > it hard to read it as normative.

    I read Church and NT Greek, but am no expert. However, it seems to me
    that the way to solve the problem is to create a new codepoint, GREEK
    CAPITAL LETTER ALPHA WITH YPOGEGRAMMENI, whose glyph is AI.
    Lowercasing it would translate to alpha with hypogegrammeni.

    > This type of problem does not occur with NF(K)D strings - the
    > combining class of U+0345 forces it to the end of the cluster. It
    > is for this reason that the formal definitions of canonical and
    > compatibility caseless matches use NFD and NFKD respectively.



    This archive was generated by hypermail 2.1.5 : Mon Jun 12 2006 - 03:55:49 CDT