Re: What are the issues in having U+FB06 fold to U+FB05?

From: Ken Whistler <kenw_at_sybase.com>
Date: Wed, 06 Jul 2011 15:23:41 -0700

On 7/6/2011 1:40 PM, Mark Davis ☕ wrote:
>
> The other two are special cases; they casefold together
> because of the
> way that the full case mapping is computed. Their equivalence is
> normally captured by a canonical-equivalent folding. Because
> the simple
> folding is only codepoint by codepoint, and only resulting in
> single
> code points, they can't be added.
>
> I didn't understand the sentence above. But would it be fair to
> say that a plausible case could be made for FB06 folding to FB05
> simply, but that there really shouldn't be a simple fold for the
> other two cases?
>
>
> Yes, that's what I mean. You can propose all three if you want, via
> the reporting form, but I think only #1 is a real possibility (IMO).

For those following along (or not), this has to do with entries in
CaseFolding.txt. The current relevant sections of CaseFolding.txt are:

FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND
TONOS
1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND
OXIA

What Karl is suggesting amounts to updating those entries to:

FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T
FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND
TONOS
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND
OXIA

Note that I think the plausible simple folding for the first group is
FB05 *to* FB06, not vice versa.

As for the other two, taking the 0390/1FD3 pair as the example
we would have, currently, for simple case folding:

simpleCaseFold(0390) = 0390
simpleCaseFold(1FD3) = 1FD3

simpleCaseFold(NFD(0390)) = 03B9 0308 0301
simpleCaseFold(NFD(1FD3)) = 0390 0308 0301

and for full case folding:

CaseFold(0390) = 03B9 0308 0301
CaseFold(1FD3) = 03B9 0308 0301

CaseFold(NFD(0390)) = 03B9 0308 0301
CaseFold(NFD(1FD3)) = 0390 0308 0301

In all of these instances, because 1FD3 is canonically equivalent to
0390, the
results of the folding are canonically equivalent. While there might not
be any
actual prohibition against adding a simple case folding of 1FD3 to 0390
explicitly
in CaseFolding.txt, I don't see that it buys anybody anything. This is
roughly the
same problem as, for example:

simpleCaseFold(00E1) = 00E1
simpleCaseFold(0061 0301) = 0061 0301

simpleCasefold(NFD(00E1) = 0061 0301
simpleCasefold(NFD(0061 0301) = 0061 0301

and noting that the results of the simpleCasefold of those two different
sources
are canonically equivalent, even if you don't do the normalization
before the
case folding. An application which is doing case folding, but which isn't
checking for canonical equivalence is kinda out to lunch, anyway, as this
example demonstrates.

So while I don't quite understand Mark's claim that "they can't be added", I
would say that I agree at least that I don't see any point to adding them.

I'm not sure whether the FB05/FB06 instance is important enough to add
or not. Neither of those compabitility ligatures should ordinarily be used
in text, anyway, and it hard to see that an algorithmic neatness argument
buys much here in the way of actual utility.

--Ken
Received on Wed Jul 06 2011 - 17:27:52 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 06 2011 - 17:27:53 CDT