Re: Tentative Definition of Casefolding

From: Richard Wordingham (
Date: Sun Jun 11 2006 - 10:59:27 CDT

  • Next message: Russell Shaw: "PDFs of Unicode Standard Annex"

    Philippe Verdy wrote on Sunday, June 11, 2006 at 12:34 PM

    > From: "Richard Wordingham"

    >> C: Unachievable Target:
    >> If X and Y are canonically equivalent, so are f(X) and f(Y). This can
    >> fail
    >> because one of the casing operations does not preserve canonical
    >> equivalence.

    > Thne this looks like a bug in this particular casing operation. Any casing
    > function should preserve canonical equivalences.

    That's what I thought. While I'm still waiting confirmation, it appears not
    to be the case.

    > If not, it's because it's incorrectly implemented.

    Do you mean 'defined'?

    > For example a casing operation in YPOGEGRAMENI that transfoms it from a
    > combining character into a non-combining iota letter in some casing but
    > keeps it in another casing is unsafe or incorrectly implemented or
    > insufficiently specified (that's where I would suggest updating the rule
    > to require that the YPOGEGRAMANI (if encoded separately) should behave as
    > if it was combined if the previous character, meaning that it may need to
    > be reordered before some other diacritics, prior to applying the case
    > mapping.

    The whole point is that the ypogegrammeni needs to be detached and shunted
    to *after* everything that does not have a character of combining class zero
    in its full canonical decomposition. Converting to NFD does this reliably.
    Full uppercasing and casefolding would then look something like this:

    R1: Zero the subscript iota count, and empty the output string.
    R2: For each character in sequence, performs steps R3 to R5.
    R3. If the character is of combining class zero and is not U+0F73 or U+0F75,
    add subscript iota count capital iotas (for uppercasing) or subscript iota
    count small iotas (for case folding) to the output string and zero the
    subscript iota count.
    R4. If the current character is U+0345, increment the subscript iota count
    and process the next character.
    R5. Look up the current character's casing, which may depend on its context
    within the input string. If the casing translation consists of more than
    one character and the last character is a plain iota (U+0399, U+03B9 or
    U+1FBE), strip it off the translation and increment the subscript iota
    count. Append the possibly modified translation to the output string.
    Process the next character.
    R6. Add subscript iota count capital iotas (for uppercasing) or subscript
    iota count small iotas (for case folding) to the output string.

    I hope we don't finad any more non-zero combining class characters
    case-converting to combining class zero characters.

    > As a consequence, if Y1=uc(X1) and Y2=uc(X2) then it is not guaranteed
    > that Y1+Y2=uc(X1)+uc(X2) will be canonically equivalent to uc(X1+X2). But
    > the concatenation operation is already known for not preserving the
    > canonical equivalences, notably when it is used on defective operands.

    I'm not sure what you are saying here. If A1 and B1 are canonically
    equivalent and A2 and B2 are canonically equivalent, then A1+A2 and B1+B2
    are canonically equivalent. Are you claiming there is a counter-example?
    Concatenation does not preserve normalisation forms

    > Applying a case mapping on an isolated YPOGEGRAMENI looks like a
    > pathological (unsafe) case, because it's a defective combining sequence.

    Yes, but consider titlecasing 'ffrench' in the appropriate English locale!
    In some traditions it titlecases to 'ffrench', not 'Ffrench'. Casing
    contexts are not entirely restricted to *default* grapheme clusters, just
    the ones mention in TUS.


    This archive was generated by hypermail 2.1.5 : Sun Jun 11 2006 - 11:18:42 CDT