Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Jun 09 2006 - 15:10:51 CDT

  • Next message: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

    Those are good questions, and the subjects are tricky. I found it hard to
    follow your text, so bear with me if I have misinterpreted it.

    1. The specification of the process by which the case folding mappings are
    composed has already been fixed in Unicode 5.0, to note that the dotless i
    is [and always has been] an exception. If someone wants a case folding that
    handles Turkic they have to tailor the case folding mappings to handled them
    slightly differently. It would probably be a good idea to document this also
    in the data file in the future.

    (The UTC has reviewed the situation with the Turkish i's more times than I
    care to remember. The unfortunate decision, some 75 years ago, to
    distinguish dotted and dotless i in the Turkish orthography causes no end of
    problems. I always have to look back at the table I made in
    http://www.macchiato.com/slides/GlobalizationGotchas.ppt (slide 31) every
    time I try to think about this topic!)

    2. I don't think you're interpreting the stability clause correctly. What it
    says is that if you have a string that is in NFKC form, and only contain
    characters from Unicode version X, then its casefold will remain stable in
    versions after X.

    You say:
    > Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1 GREEK
    > SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>, while
    > toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER ALPHA WITH
    > MACRON, U+03B9>. But the casefolded forms are not canonically equivalent!

    > The problem here is that the definition of toCasefold() offers no hint
    that
    > when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be hidden in a

    But the sources you are starting with are not canonically equivalent:

    toNFD(U+1FB3 U+0304) = U+03B1 U+0304 U+0345
    toNFD(U+1FB9 U+0399) = U+0391 U+0304 U+0399

    and Section 3.13 does explicitly says:

    As described earlier, normally caseless matching should also use
    normalization, which
    means using one of the following operations:
    • A string X is a canonical caseless match for a string Y if and only if:
    NFD(toCasefold(NFD(X))) =
    NFD(toCasefold(NFD(Y)))
    • A string X is a compatibility caseless match for a string Y if and only
    if:
    NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =
    NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))
    The invocations of normalization before folding in the above definitions are
    to catch very
    infrequent edge cases. Normalization is not required before folding, except
    for the character
    U+0345 and any characters that have it as part of their decomposition, such
    as U+1FC3.
    In practice, optimized versions of implementations can catch these special
    cases and,
    thereby, avoid an extra normalization.

    Thus there is already noted that in certain (unusual) cases, normalization
    is required to preserve cannonical equivalency.

    But you also note this at the end of your message, so I'm not sure what the
    issue is.

    If you could boil down the cases you think represent a problem to create a
    minimal test case, it might be easier to see what the issue is.

    Mark

    On 6/8/06, Richard Wordingham <richard.wordingham@ntlworld.com> wrote:
    > There appear to be bugs in the definition of the case-folding function
    > toCasefold() as currently defined by
    > http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt and
    Section
    > 3.13 of TUS 4.1.0. (I am using the latter as I cannot find a more
    reliable
    > draft of Section 3.13 in TUS 5.0.) This matters, for toCasefold() of NFKC
    > strings valid in Unicode 5.0 is about to be frozen forever. Should these
    > faults be made permanent?
    >
    > I have found two groups of NFKC grapheme clusters which fail to match
    their
    > default uppercasings after conversion to NFD in one of the important
    > 'case-insensitive' matching methods. I haven't reported these problems
    > formally yet - I'd like to see what other people think first. It's
    > conceivable that I'm the only person bothered by them.
    >
    > *Problem 1*
    >
    > The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>
    >
    > The problem with this only occurs when using the default mappings. A
    > different can of worms opens up for Turkic locales - I don't know whether
    > the behaviour is fully defined for Turkic locales. This grapheme cluster
    is
    > in all four normalised forms. According to
    > http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt and
    > http://www.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d11.txt , its
    > uppercasing (in all locales) is
    >
    > <U+0049 LATIN CAPITAL LETTER I>
    >
    > which is in all four normal forms.
    >
    > To compare these strings for 'canonical caseless matches', one calculates
    > NFD(toCasefold(NFD())) of the strings. By
    > http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
    > default casefoldings, whether simple or full, are <U+0131> and <U+0069
    LATIN
    > SMALL LETTER I>. These are not canonically equivalent. QED.
    >
    > Incidentally, the definition of default casefolding contradicts the
    > definition of casefolding given in TUS 4.1.0 Section 5.18.
    >
    > There are two alternative solutions:
    > (a) Remove the upper- and title-casings for U+0131 from UnicodeData.txtand
    > uncomment out the Turkic data for U+0131 in SpecialCasing.dat, also making
    > it apply to Azer(baijan)i.
    > (b) Add two lines to SpecialCasing.dat:
    >
    > 0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
    > 0131; T; 0131; # LATIN SMALL LETTER DOTLESS I
    >
    > *Problem 2*
    >
    > The second group is probably much less troublesome, but is quite awkward.
    > There are two plausible NFC and NFKC sequences
    >
    > <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0306 COMBINING
    BREVE>
    >
    > <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0304 COMBINING
    > MACRON>
    >
    > The former might occur if one were using a breve (or brachy) to explicitly
    > mark the lack of stress in polytonic Modern Greek, for example in
    explaining
    > the meter of poetry. The second might occur if one decided to redundantly
    > mark Classical Greek vowel length - the macron is redundant, for the
    > subscript iota implies that the vowel is long. I don't have any examples
    of
    > these combinations. I will work with the latter.
    >
    > Converted to NFD, it yields
    >
    > <03B1 GREEK SMALL LETTER ALPHA, U+0304, 0345 COMBINING GREEK
    YPOGEGRAMMENI>
    >
    > The default uppercasing (not uncontroversial, but that's a linguistic
    > matter) is
    > <0391 GREEK CAPITAL LETTER ALPHA, U+0304, U+0399 GREEK CAPITAL LETTER
    IOTA>,
    > whose NFC and NFKC form is
    >
    > <1FB9 GREEK CAPITAL LETTER ALPHA WITH MACRON, U+0399 GREEK CAPITAL LETTER
    > IOTA>
    >
    > Now, the case-insensitive match whose outcome is guaranteed to be stable
    > under the case-folding stability policy
    > (http://www.unicode.org/standard/stability_policy.html) is given by
    > toCasefold() of NFKC strings.
    >
    > Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1 GREEK
    > SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>, while
    > toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER ALPHA WITH
    > MACRON, U+03B9>. But the casefolded forms are not canonically equivalent!
    >
    > The problem here is that the definition of toCasefold() offers no hint
    that
    > when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be hidden in a
    > precomposed form, is detached as U+03B9, it should be moved to after any
    > immediately following characters of non-zero combining class (and
    characters
    > that decompose solely to such - U+0F73 and U+0F75.) SpecialCasing.txt has
    > at least an implication that such should be done when a U+0399 detaches
    > itself, but I find it hard to read it as normative.
    >
    > This type of problem does not occur with NF(K)D strings - the combining
    > class of U+0345 forces it to the end of the cluster. It is for this
    reason
    > that the formal definitions of canonical and compatibility caseless
    matches
    > use NFD and NFKD respectively.
    >
    > Richard.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 15:19:56 CDT