Re: More Permanent Faults? - Unicode 5.0 Casefolding

From: Mark Davis ([email protected])
Date: Fri Jun 09 2006 - 15:10:51 CDT

Next message: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"

Previous message: Karl Pentzlin: "Re: Glyphs for German quotation marks"
In reply to: Richard Wordingham: "More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Those are good questions, and the subjects are tricky. I found it hard to
follow your text, so bear with me if I have misinterpreted it.

1. The specification of the process by which the case folding mappings are
composed has already been fixed in Unicode 5.0, to note that the dotless i
is [and always has been] an exception. If someone wants a case folding that
handles Turkic they have to tailor the case folding mappings to handled them
slightly differently. It would probably be a good idea to document this also
in the data file in the future.

(The UTC has reviewed the situation with the Turkish i's more times than I
care to remember. The unfortunate decision, some 75 years ago, to
distinguish dotted and dotless i in the Turkish orthography causes no end of
problems. I always have to look back at the table I made in
http://www.macchiato.com/slides/GlobalizationGotchas.ppt (slide 31) every
time I try to think about this topic!)

2. I don't think you're interpreting the stability clause correctly. What it
says is that if you have a string that is in NFKC form, and only contain
characters from Unicode version X, then its casefold will remain stable in
versions after X.

You say:
> Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1 GREEK
> SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>, while
> toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER ALPHA WITH
> MACRON, U+03B9>. But the casefolded forms are not canonically equivalent!

> The problem here is that the definition of toCasefold() offers no hint
that
> when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be hidden in a

But the sources you are starting with are not canonically equivalent:

toNFD(U+1FB3 U+0304) = U+03B1 U+0304 U+0345
toNFD(U+1FB9 U+0399) = U+0391 U+0304 U+0399

and Section 3.13 does explicitly says:

As described earlier, normally caseless matching should also use
normalization, which
means using one of the following operations:
• A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD(X))) =
NFD(toCasefold(NFD(Y)))
• A string X is a compatibility caseless match for a string Y if and only
if:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =
NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))
The invocations of normalization before folding in the above definitions are
to catch very
infrequent edge cases. Normalization is not required before folding, except
for the character
U+0345 and any characters that have it as part of their decomposition, such
as U+1FC3.
In practice, optimized versions of implementations can catch these special
cases and,
thereby, avoid an extra normalization.

Thus there is already noted that in certain (unusual) cases, normalization
is required to preserve cannonical equivalency.

But you also note this at the end of your message, so I'm not sure what the
issue is.

If you could boil down the cases you think represent a problem to create a
minimal test case, it might be easier to see what the issue is.

Mark

On 6/8/06, Richard Wordingham <[email protected]> wrote:
> There appear to be bugs in the definition of the case-folding function
> toCasefold() as currently defined by
> http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt and
Section
> 3.13 of TUS 4.1.0. (I am using the latter as I cannot find a more
reliable
> draft of Section 3.13 in TUS 5.0.) This matters, for toCasefold() of NFKC
> strings valid in Unicode 5.0 is about to be frozen forever. Should these
> faults be made permanent?
>
> I have found two groups of NFKC grapheme clusters which fail to match
their
> default uppercasings after conversion to NFD in one of the important
> 'case-insensitive' matching methods. I haven't reported these problems
> formally yet - I'd like to see what other people think first. It's
> conceivable that I'm the only person bothered by them.
>
> *Problem 1*
>
> The first is: <U+0131 LATIN SMALL LETTER DOTLESS I>
>
> The problem with this only occurs when using the default mappings. A
> different can of worms opens up for Turkic locales - I don't know whether
> the behaviour is fully defined for Turkic locales. This grapheme cluster
is
> in all four normalised forms. According to
> http://www.unicode.org/Public/5.0.0/ucd/SpecialCasing-5.0.0d13.txt and
> http://www.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d11.txt , its
> uppercasing (in all locales) is
>
> <U+0049 LATIN CAPITAL LETTER I>
>
> which is in all four normal forms.
>
> To compare these strings for 'canonical caseless matches', one calculates
> NFD(toCasefold(NFD())) of the strings. By
> http://www.unicode.org/Public/5.0.0/ucd/CaseFolding-5.0.0d13.txt , their
> default casefoldings, whether simple or full, are <U+0131> and <U+0069
LATIN
> SMALL LETTER I>. These are not canonically equivalent. QED.
>
> Incidentally, the definition of default casefolding contradicts the
> definition of casefolding given in TUS 4.1.0 Section 5.18.
>
> There are two alternative solutions:
> (a) Remove the upper- and title-casings for U+0131 from UnicodeData.txtand
> uncomment out the Turkic data for U+0131 in SpecialCasing.dat, also making
> it apply to Azer(baijan)i.
> (b) Add two lines to SpecialCasing.dat:
>
> 0131; C; 0061; # LATIN SMALL LETTER DOTLESS I
> 0131; T; 0131; # LATIN SMALL LETTER DOTLESS I
>
> *Problem 2*
>
> The second group is probably much less troublesome, but is quite awkward.
> There are two plausible NFC and NFKC sequences
>
> <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0306 COMBINING
BREVE>
>
> <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0304 COMBINING
> MACRON>
>
> The former might occur if one were using a breve (or brachy) to explicitly
> mark the lack of stress in polytonic Modern Greek, for example in
explaining
> the meter of poetry. The second might occur if one decided to redundantly
> mark Classical Greek vowel length - the macron is redundant, for the
> subscript iota implies that the vowel is long. I don't have any examples
of
> these combinations. I will work with the latter.
>
> Converted to NFD, it yields
>
> <03B1 GREEK SMALL LETTER ALPHA, U+0304, 0345 COMBINING GREEK
YPOGEGRAMMENI>
>
> The default uppercasing (not uncontroversial, but that's a linguistic
> matter) is
> <0391 GREEK CAPITAL LETTER ALPHA, U+0304, U+0399 GREEK CAPITAL LETTER
IOTA>,
> whose NFC and NFKC form is
>
> <1FB9 GREEK CAPITAL LETTER ALPHA WITH MACRON, U+0399 GREEK CAPITAL LETTER
> IOTA>
>
> Now, the case-insensitive match whose outcome is guaranteed to be stable
> under the case-folding stability policy
> (http://www.unicode.org/standard/stability_policy.html) is given by
> toCasefold() of NFKC strings.
>
> Now, toCasefold of the starting point, <U+1FB3, U+0304>, is <U+03B1 GREEK
> SMALL LETTER ALPHA, U+03B9 GREEK SMALL LETTER IOTA, U+0304>, while
> toCasefold of <1FB9, U+0399> is <U+1FB1 GREEK SMALL LETTER ALPHA WITH
> MACRON, U+03B9>. But the casefolded forms are not canonically equivalent!
>
> The problem here is that the definition of toCasefold() offers no hint
that
> when U+0345 COMBINING GREEK YPOGEGRAMMENI, which may be hidden in a
> precomposed form, is detached as U+03B9, it should be moved to after any
> immediately following characters of non-zero combining class (and
characters
> that decompose solely to such - U+0F73 and U+0F75.) SpecialCasing.txt has
> at least an implication that such should be done when a U+0399 detaches
> itself, but I find it hard to read it as normative.
>
> This type of problem does not occur with NF(K)D strings - the combining
> class of U+0345 forces it to the end of the cluster. It is for this
reason
> that the formal definitions of canonical and compatibility caseless
matches
> use NFD and NFKD respectively.
>
> Richard.
>
>
>

Next message: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Previous message: Karl Pentzlin: "Re: Glyphs for German quotation marks"
In reply to: Richard Wordingham: "More Permanent Faults? - Unicode 5.0 Casefolding"
Next in thread: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Reply: Richard Wordingham: "Re: More Permanent Faults? - Unicode 5.0 Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 15:19:56 CDT