FCD and Collation

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 11 Feb 2013 23:47:07 +0000

Back in the topic 'Text in composed normalized form is king, right?
Does anyone generate text in decomposed normalized form?', I mentioned
that normalisation can be necessary even to collated FCD text
correctly, and gave two examples:

Danish (still at CLDR Version 22.1) <U+0061 LATIN SMALL LETTER A,
U+00E5 LATIN SMALL LETTER A WITH RING ABOVE>, for which there is an ICU
bug report http://bugs.icu-project.org/trac/ticket/9319

Default collation <U+0F71 TIBETAN VOWEL SIGN AA, U+0F73 TIBETAN VOWEL
SIGN II>

I remarked that the UCA (Technical Report 10) and LDML
(Techical Report 35) specifications, taken together, make sense only if
there is no such problem.

Before raising a specific Unicode bug, I think it would be worth
exploring the options. The concepts of FCD is defined in Unicode
Technical Note #5 (UTN#5) Canonical Equivalence in Applications
http://www.unicode.org/notes/tn5/ , along with the concept of
canonical closure. Unicode has not admitted to endorsing it, so I
suspect that I can't raise a bug report against it!

The process of Unicode collation, in its full form, proceeds through
at least the following steps:

1) Normalise the text string to NFD.

2) Split a fully decomposed canonically equivalent string (a
rearrangement of the normalised string) into sequences of 'collating
elements'*. As the non-zero numbers of the canonical combing classes
are to a degree arbitrary, and this rearrangement attempts to undo the
artificiality of the canonical order to better accord with the
language of the text. Compared to the original normalised string, these
collating elements interleave.

3) Look up sequences of ordered n-tuples of numbers, known as
'collation elements'*, for each collating element.

4) Adjust the sequences of n-tuples to reduce the untoward effects of
symbols, spaces and punctuation.

5) Convert the n-tuples to a simple sequence of numbers - the 'sort
key' - that may be used for comparison.

*The term 'collating element' is taken from ISO 14651, 'collation
element' from the UCA. I am not all sure how one distinguishes them
in French! The ordering is largely defined by the mapping from
collating elements to collation elements.

Step 1 is a complete waste of time for most text in many languages, and
therefore there is great interest in omitting it. Step 2 is easy to
get wrong, especially if the text has not in fact been normalised.

The primary problem with UTN#5 is that it fails to address the issue of
decomposing the normalised string into collating elements, which is how
the two examples above fail. Markus Scherer has identified the problem
as being that in some collations, characters need to be split between
collating elements.

There are several tweaks and options that could be chosen.

The FCD check uses, for each character x, the canonical combining class
(ccc) of the leading element in its NFD decomposition, lcc(x), and the
canonical combining class of the trailing element, tcc(x). The FCD
check, checks for adjacent elements x and y, that one of the following
three conditions hold:

(1) tcc(x) = 0; or
(2) lcc(y) = 0; or
(3) tcc(x) <= lcc(y)

The Tibetan example fails because the components of <U+0F73 TIBETAN
VOWEL SIGN II> have different canonical combining classes. <U+0F71,
U+0F73> has decomposes to <U+0F71, U+0F71, U+0F72>, which is then split
into collating elements <U+0F71, U+0F72> and <U+0F71>. To stop this
case being FCD, we could replace condition (3) by

(3') tcc(x) <= lcc(y) and lcc(y) = tcc(y).

Markus Scherer has suggested simply prohibiting composed characters
whose complete decomposition lack characters with ccc = 0. I think the
difference amounts to one character, U+0344 COMBINING GREEK DIALYTIKA
TONOS, which decomposes to two characters with the same canonical
combining class.

Note that the 'Tibetan example' comes from the default collation; the
most relevant language appears to be Sanskrit!

The next tweak would be to canonical closure. I should first comment
on a little known potential issue with the generated collation element
table. Consider a hypothetical language whose collation differed from
the default by having a 'LETTER Y WITH DOUBLE ACUTE' as a letter of
the alphabet. There is no composed character for this. Suppose further
that it used a dot below as an ordinary accent. Now, when collating
<U+1EF5 LATIN SMALL LETTER Y WITH DOT BELOW, U+030B COMBINING DOUBLE
ACUTE ACCENT>, the NFD form would be <U+0079 LATIN SMALL LETTER Y,
U+0323 COMBINING DOT BELOW, U+030B>, and the relevant possible
collating elements would be:

<U+0079>
<U+0079, U+030B> ! a letter in this language
<U+030B>
<U+0323>

The correct decomposition into collating elements would then be:
<U+0079, U+030B>, <U+0323>

When we form the 'canonical closure' as described in UTN#5, we also
generate an entry in the canonically closed collation table for
U+1EF5. When we try to use the FCD trick for the collation of
<U+1EF5, U+030B>, we are liable to decompose it into collating elements

<U+1EF5>, <U+030B>

which would have the same collation elements as

<U+0079>, <U+0323>, <U+030B>

instead of using the collating elements for <U+0079, U+030B>.

The tweak is for canonical closure to generate a collating
element for <U+1EF5, U+030B> (and similarly for all the other
precomposed characters containing 'u'). More subtly, the creation of a
collating element for <U+1EF5> should *not* lead to the creation of a
collating element for <U+0079, U+0323>, for that would lead to errors
in the detection of interleaving collating elements!

The handling of the Danish case would also be included in the
adjustment of canonical closure.

Does anyone feel up to rigorously justifying revisions to the concepts
and algorithms of FCD and canonical closure? Occasionally one will
encounter cases where the canonical closure is infinite - in these
cases, normalisation will be necessary regardless of the outcome of the
FCD check.

Perhaps one could merely revise the definition of FCD, and devise a test
for the adequacy of the current canonical closure. If the collation
fails this adequacy test, then again disabling normalisation should be
prohibited. (I would suggest that in these cases the normalisation
setting should be overridden with only the gentlest of chidings.)

A lazy option would be to wait (how long?) and then remove the option of no
normalisation on the ground that sufficient computing power is
available.

Thoughts, anyone?

Richard.
Received on Mon Feb 11 2013 - 17:53:04 CST

This archive was generated by hypermail 2.2.0 : Mon Feb 11 2013 - 17:53:15 CST