Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Fri, 18 May 2012 17:11:46 +0100

On Thu, 17 May 2012 21:32:19 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> Ok, but assuming we didn't add 0FB2+0F71, why can't we add the
> contraction 0FB2+0F81 and have the 0334 and any other non-starter be
> handled via discontiguous matching?

Time for me to make a pronouncement on collation in FCD from my ivory
tower. First, I need some notation. For a string S, uni(S) is the
single character canonically equivalent to it. If there are multiple
such characters, uni(S) is selected arbitrarily but determinisitically,
e.g. the first such character in code point order. If there is no such
character, the notation uni(S) is invalid. I am also assuming that the
set of contractions for use with normalisation is automatically
subjected to canonical closure.

Up to UCA 6.1.0 (UTS#10 Version 24), there are two modes of
contraction identification - contiguous and discontiguous. When
working with FCD strings rather than NFD strings (i.e. with
normalisation switched off), there are therefore various types of
contractions. 2-element contractions in FCD can be split into
contiguous and discontiguous contractions.

Given an NFD contraction A+B+C, uni(<A,B>)+C is a discontiguous FCD
contraction. If there is also an NFD contraction A+B, then
A+uni(<B,C>) is also a *discontiguous* FCD contraction. However, if
there is no NFD contraction A+B, then A+uni(<B,C>) is a *contiguous*
FCD contraction. It can only be applied to a subsequence <A,
uni(<B,C>)>, never to a subsequence <A, X, uni(<B,C>)>.

For example, in DUCET 6.1.0 (and earlier), there is an NFD contraction
0FB2+0F71+0F80, but no contraction 0FB2+0F71. Consequently,
0FB2+uni(0F71,0F80), i.e. 0FB2+0F81, although listed in the DUCET 6.1.0
file allkeys.txt, is only a *contiguous* FCD contraction. Therefore it
has no effect on the collation of <0FB2, 0334, OF81>.

Blocking also changes subtly when one proceeds from NFD to FCD. In
NFD, B blocks C if and only if ccc(B) = ccc(C), ccc(B) = 0 or ccc(C) =
0. Equivalently, B does *not* block C if and only if B and C are
distinct and <B,C> and <C,B> are canonically equivalent. For FCD, we
must use the latter definition. Additionally, the concept is only
defined if <B,C> is FCD. Determining this in the general case is not
quick. However, for Unicode 6.1.0 this can be greatly simplified by
replacing the ccc look-up function by eccc where:

eccc(uni(<0F71,x>)) = ccc(x)
eccc(x) = ccc(x) otherwise

and then using the first form of the NFD definition of blocking. This
simplification could be defeated by the addition of new non-singleton
decompositions for characters with non-zero ccc, which Unicode is about
to promise not to add, or the addition of new characters with the same
canonical combining class as U+0F71.

The nasty complication is determining what the contractions for FCD
processing are from the contractions for NFD processing. Sometimes a
finite set for NFD processing expands to an infinite set for FCD
processing.

Richard.
Received on Fri May 18 2012 - 11:16:57 CDT

This archive was generated by hypermail 2.2.0 : Fri May 18 2012 - 11:16:58 CDT