From: Richard Wordingham <richard.wordingham_at_ntlworld.com>

Date: Fri, 18 May 2012 17:11:46 +0100

Date: Fri, 18 May 2012 17:11:46 +0100

On Thu, 17 May 2012 21:32:19 -0700

Markus Scherer <markus.icu_at_gmail.com> wrote:

*> Ok, but assuming we didn't add 0FB2+0F71, why can't we add the
*

*> contraction 0FB2+0F81 and have the 0334 and any other non-starter be
*

*> handled via discontiguous matching?
*

Time for me to make a pronouncement on collation in FCD from my ivory

tower. First, I need some notation. For a string S, uni(S) is the

single character canonically equivalent to it. If there are multiple

such characters, uni(S) is selected arbitrarily but determinisitically,

e.g. the first such character in code point order. If there is no such

character, the notation uni(S) is invalid. I am also assuming that the

set of contractions for use with normalisation is automatically

subjected to canonical closure.

Up to UCA 6.1.0 (UTS#10 Version 24), there are two modes of

contraction identification - contiguous and discontiguous. When

working with FCD strings rather than NFD strings (i.e. with

normalisation switched off), there are therefore various types of

contractions. 2-element contractions in FCD can be split into

contiguous and discontiguous contractions.

Given an NFD contraction A+B+C, uni(<A,B>)+C is a discontiguous FCD

contraction. If there is also an NFD contraction A+B, then

A+uni(<B,C>) is also a *discontiguous* FCD contraction. However, if

there is no NFD contraction A+B, then A+uni(<B,C>) is a *contiguous*

FCD contraction. It can only be applied to a subsequence <A,

uni(<B,C>)>, never to a subsequence <A, X, uni(<B,C>)>.

For example, in DUCET 6.1.0 (and earlier), there is an NFD contraction

0FB2+0F71+0F80, but no contraction 0FB2+0F71. Consequently,

0FB2+uni(0F71,0F80), i.e. 0FB2+0F81, although listed in the DUCET 6.1.0

file allkeys.txt, is only a *contiguous* FCD contraction. Therefore it

has no effect on the collation of <0FB2, 0334, OF81>.

Blocking also changes subtly when one proceeds from NFD to FCD. In

NFD, B blocks C if and only if ccc(B) = ccc(C), ccc(B) = 0 or ccc(C) =

0. Equivalently, B does *not* block C if and only if B and C are

distinct and <B,C> and <C,B> are canonically equivalent. For FCD, we

must use the latter definition. Additionally, the concept is only

defined if <B,C> is FCD. Determining this in the general case is not

quick. However, for Unicode 6.1.0 this can be greatly simplified by

replacing the ccc look-up function by eccc where:

eccc(uni(<0F71,x>)) = ccc(x)

eccc(x) = ccc(x) otherwise

and then using the first form of the NFD definition of blocking. This

simplification could be defeated by the addition of new non-singleton

decompositions for characters with non-zero ccc, which Unicode is about

to promise not to add, or the addition of new characters with the same

canonical combining class as U+0F71.

The nasty complication is determining what the contractions for FCD

processing are from the contractions for NFD processing. Sometimes a

finite set for NFD processing expands to an infinite set for FCD

processing.

Richard.

Received on Fri May 18 2012 - 11:16:57 CDT

*
This archive was generated by hypermail 2.2.0
: Fri May 18 2012 - 11:16:58 CDT
*