# Re: Tentative Definition of Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 11 2006 - 06:26:06 CDT

• Next message: Philippe Verdy: "Re: Tentative Definition of Casefolding"

SADAHIRO Tomoyuki wrote on Sunday, June 11, 2006 4:03 AM

>> C: Draft form:
>> If uc(X) and uc(Y) are canonically equivalent, lc(X) and lc(Y) are
>> canonically equivalent, and tc(X) and tc(Y) are canonically equivalent,
>> so
>> are f(X) and f(Y).
>
>> D: Draft form:
>> If X is the concatenation of X1 and X2, lc(X) is the concatenation of
>> lc(X1)
>> and lc(X2), uc(X) is the concatenation of uc(X1) and uc(X2), and tc(X)
>> is
>> the concatenation of tc(X1) and lc(X2) [N.B. lc, not tc!] then f(X) is
>> the
>> concatenation of f(X1) and f(X2).
>
> Interesting. Do these drafts imply that if at least one of uppercase,
> lowercase, and titlecase is decomposed, then *all* of the cases must
> be decomposed?

No. The prime examples are the default casefoldings, both simple and full.
They derive from modifications of the corresponding default lowercasing,
uppercasing and titlecasing. The modifications are to exclude the simple
lowercasing of U+0130 to U+0069 and both the simple and full uppercasing and
titlecasing of U+0131 to U+0049. So far as I am aware, these two
casefoldings satisfy all of the properties C, D, K and S. The issues arise
with customised casing operations. A specific Turkic casefolding has been
defined, but not yet a specific Lithuanian casefolding.

However, the current (4.1) and draft (5.0) practice is to use the decomposed
form in the *case-folded* form rather than to force case-folding to perform
difficult compostions. For example, the following admittedly implausible
combinations all case-fold to 0061 0053 0053 0053 0061:

a) 0061 0053 0053 0053 0061
b) 0061 00DF 0053 0061
c) 0061 0053 00DF 0061

It is much simpler to fold them all to form (a), and so doing does not
breach property D. The other two foldings would all breach property D.

> [CURRENT] code; lower; title; upper;
> 00DF; 00DF; 0053 0073; 0053 0053; # latin small sharp S
...

> [according to the PROPOSAL] code; lower; title; upper;
> 00DF; 0073 0073; 0053 0073; 0053 0053; # latin small sharp S
...

Not propsed at all!

>> K: Draft Form:
>> If uc(X) and uc(Y) are compatibility equivalent, lc(X) and lc(Y) are
>> compatibility equivalent, and tc(X) and tc(Y) are compatibility
>> equivalent,
>> so are f(X) and f(Y).
>
> Do you mean "If X and Y are compatibility equivalent, then uc(X) and
> uc(Y) are compatibility equivalent, lc(X) and lc(Y) are compatibility
> equivalent, and tc(X) and tc(Y) are compatibility equivalent,
> so are f(X) and f(Y)." ?

No. All three conditions have to be satisfied.

> According to the draft form K (original), the case mappings of
> SQUARE MV MEGA (U+33B9) will be same as those of the sequence of
> Latin <M, V>, but the case mappings of SQUARE MV (U+33B7) are
> different.
> According to the draft form K implying "If X and Y are compatibility
> equivalent", the case mappings of both SQUARE MV MEGA and SQUARE MV
> will be same as those of the sequence of Latin <M, V>.
> (Note: we have square mV and MV but we don't have square mv and Mv.)

No. Casing operations do not modify any of the 'SQUARE' characters. Thus
lc(U+33B9) is not compatibility equivalent to lc(<M, V>), so their foldings
need not by compatibilty equivalent. Similarly uc(U+33B7) is not
compatibility equivalent to lc(<M,V>), so their foldings need not be
compatibility equivalent.

> But the case folding among SQUARE MV (millivolt), SQUARE MV MEGA
> (megavolt) and sequence <m, v> in the bicameral script has usefulness?
> In my opinion, case mappings respecting and/or preserving compatible
> equivalence are not good idea. Some compatibility decomposable characters
> will lost their meanings significantly through such a case folding.

Fortunately, the meanings of these characters are preserved, unlike the
ASCII '5mV' and '5MV'. Because I work in a Fortran environment, I
frequently encounter time allegedly in units of 'MS', and I don't think that
is intended to mean megasiemens!

One might want case-folding to preserve the compatibility equivalence of
U+0133 LATIN SMALL LIGATURE IJ and U+01C9 LATIN SMALL LETTER LJ. These
characters do undergo casing, and therefore I believe that if U+0049 LATTIN
CAPITAL LETTER I were folded to U+0049, one should also fold U+0133 to
U+0132 LATIN CAPITAL LIGATURE IJ and U+01C9 to U+01C7 LATIN CAPITAL LETTER
LJ. It's beginning to look as though Lithuanian should be case-folded to
upper case!

Richard.
Richard.

This archive was generated by hypermail 2.1.5 : Sun Jun 11 2006 - 06:30:56 CDT