Re: Tentative Definition of Casefolding

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 11 2006 - 10:59:27 CDT

Next message: Russell Shaw: "PDFs of Unicode Standard Annex"

Previous message: Shariqul Islam Azad - Omi: "RE: Yahoo groups support for Unicode"
In reply to: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Next in thread: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Reply: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote on Sunday, June 11, 2006 at 12:34 PM

> From: "Richard Wordingham" richard.wordingham@ntlworld.com

>> C: Unachievable Target:
>> If X and Y are canonically equivalent, so are f(X) and f(Y). This can
>> fail
>> because one of the casing operations does not preserve canonical
>> equivalence.

> Thne this looks like a bug in this particular casing operation. Any casing
> function should preserve canonical equivalences.

That's what I thought. While I'm still waiting confirmation, it appears not
to be the case.

> If not, it's because it's incorrectly implemented.

Do you mean 'defined'?

> For example a casing operation in YPOGEGRAMENI that transfoms it from a
> combining character into a non-combining iota letter in some casing but
> keeps it in another casing is unsafe or incorrectly implemented or
> insufficiently specified (that's where I would suggest updating the rule
> to require that the YPOGEGRAMANI (if encoded separately) should behave as
> if it was combined if the previous character, meaning that it may need to
> be reordered before some other diacritics, prior to applying the case
> mapping.

The whole point is that the ypogegrammeni needs to be detached and shunted
to *after* everything that does not have a character of combining class zero
in its full canonical decomposition. Converting to NFD does this reliably.
Full uppercasing and casefolding would then look something like this:

R1: Zero the subscript iota count, and empty the output string.
R2: For each character in sequence, performs steps R3 to R5.
R3. If the character is of combining class zero and is not U+0F73 or U+0F75,
add subscript iota count capital iotas (for uppercasing) or subscript iota
count small iotas (for case folding) to the output string and zero the
subscript iota count.
R4. If the current character is U+0345, increment the subscript iota count
and process the next character.
R5. Look up the current character's casing, which may depend on its context
within the input string. If the casing translation consists of more than
one character and the last character is a plain iota (U+0399, U+03B9 or
U+1FBE), strip it off the translation and increment the subscript iota
count. Append the possibly modified translation to the output string.
Process the next character.
R6. Add subscript iota count capital iotas (for uppercasing) or subscript
iota count small iotas (for case folding) to the output string.

I hope we don't finad any more non-zero combining class characters
case-converting to combining class zero characters.

> As a consequence, if Y1=uc(X1) and Y2=uc(X2) then it is not guaranteed
> that Y1+Y2=uc(X1)+uc(X2) will be canonically equivalent to uc(X1+X2). But
> the concatenation operation is already known for not preserving the
> canonical equivalences, notably when it is used on defective operands.

I'm not sure what you are saying here. If A1 and B1 are canonically
equivalent and A2 and B2 are canonically equivalent, then A1+A2 and B1+B2
are canonically equivalent. Are you claiming there is a counter-example?
Concatenation does not preserve normalisation forms

> Applying a case mapping on an isolated YPOGEGRAMENI looks like a
> pathological (unsafe) case, because it's a defective combining sequence.

Yes, but consider titlecasing 'ffrench' in the appropriate English locale!
In some traditions it titlecases to 'ffrench', not 'Ffrench'. Casing
contexts are not entirely restricted to *default* grapheme clusters, just
the ones mention in TUS.

Richard.

Next message: Russell Shaw: "PDFs of Unicode Standard Annex"
Previous message: Shariqul Islam Azad - Omi: "RE: Yahoo groups support for Unicode"
In reply to: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Next in thread: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Reply: Philippe Verdy: "Re: Tentative Definition of Casefolding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 11 2006 - 11:18:42 CDT