Tentative Definition of Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sat Jun 10 2006 - 18:09:15 CDT

  • Next message: SADAHIRO Tomoyuki: "Re: Tentative Definition of Casefolding"

    Michael Everson wrote on Saturday, June 10, 2006 at 12:46 AM
    Subject: Re: More Permanent Faults? - Unicode 5.0 Casefolding

    > At 00:10 +0100 2006-06-10, Richard Wordingham wrote:

    >>As uppercasing, titlecasing and lowercasing may be regarded as
    >>relationships on strings, I came to the conclusion that a casefolding was
    >>an idempotent function on strings that generated the equivalence class
    >>that is the equivalence class generated by the three casing functions.

    > Quite honestly, I think you ought to re-write this sentence in English
    > which is comprehensible.

    Here goes.

    I came to the conclusion that a casefolding could be defined as follows.
    Firstly, partition the set of all strings as finely as possible into
    non-overlapping sets such that the uppercasing, lowercasing and titlecasing
    of a string are all in the same set as the original string. Then a
    casefolding is a function that maps all the members of each set to a member
    of that set.

    An equivalent, alternative formulation is:

    Given an uppercasing function uc, a lowercasing function lc, and a
    titlecasing function tc on strings, then a casefolding function is a
    function f for which:

    A: f(X) = f(uc(X)) = f(lc(X)) = f(tc(X)) = f(f(X)) for all strings X, and
    B: f(X) = f(Y) only if they are equal for all functions satisfying condition
    A.

    Further Comments:

    As far as I am aware, repeating the sequence uppercase, lowercase until the
    string changes no more will provide a casefolding. I haven't investigated
    pathological systems where this process does not converge. The Lithuanian
    locale comes closest, with U+029D LATIN SMALL LETTER J WITH CROSSED-TAIL
    followed by ten dots above taking 10 cycles to converge to plain U+029D.

    There are other useful properties that a casefolding should have, but I'm
    not sure whether I have quite captured them or indeed whether there will
    always be such a casefolding. I am assuming that the purpose of casefolding
    is simply to strip out casing information, making as little other change as
    possible.

    C: Unachievable Target:
    If X and Y are canonically equivalent, so are f(X) and f(Y). This can fail
    because one of the casing operations does not preserve canonical
    equivalence.

    C: Draft form:
    If uc(X) and uc(Y) are canonically equivalent, lc(X) and lc(Y) are
    canonically equivalent, and tc(X) and tc(Y) are canonically equivalent, so
    are f(X) and f(Y).

    D: Unachievable Target:
    If X is the concatenation of X1 and X2, then f(X) is the concatenation of
    f(X1) and f(X2). This fails with context sensitive casing. I need
    something like it to generally stop 'dodo' casefolding to 'doDO'.

    D: Draft form:
    If X is the concatenation of X1 and X2, lc(X) is the concatenation of lc(X1)
    and lc(X2), uc(X) is the concatenation of uc(X1) and uc(X2), and tc(X) is
    the concatenation of tc(X1) and lc(X2) [N.B. lc, not tc!] then f(X) is the
    concatenation of f(X1) and f(X2).

    K: Unachievable Target:
    If X and Y are compatibility equivalent, so are f(X) and f(Y). This is
    unachievable for all the reasons applicable to property (C).

    K: Draft Form:
    If uc(X) and uc(Y) are compatibility equivalent, lc(X) and lc(Y) are
    compatibility equivalent, and tc(X) and tc(Y) are compatibility equivalent,
    so are f(X) and f(Y).

    S: f(X) should not be more than the three times the size of X, whether
    measured in codepoints, UTF-8 code units or UTF-16 code units.

    Condition S would have some implications for Lithuanian casing **modified**
    to make it preserve canonical equivalence. As U+1E2E LATIN CAPITAL LETTER I
    WITH DIAERESIS AND ACUTE would then lowercase to <U+0069, U+0307, U+0308,
    U+0301>, it would have to fold to <U+0049 LATIN CAPITAL LETTER I, U+0308,
    U+0301>, with many ramifications. (This was the issue that got me
    interested in case-folding.) However, does anyone here know how Lithuanians
    actually write a diaeresis on the letter 'i'?

    Richard.



    This archive was generated by hypermail 2.1.5 : Sat Jun 10 2006 - 18:17:02 CDT