Re: FCD and Collation

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 12 Feb 2013 20:19:19 +0000

On Mon, 11 Feb 2013 17:13:58 -0800
Markus Scherer <markus.icu_at_gmail.com> wrote:

> I would not revise FCD itself. For a number of processes, it is
> sufficient as is. For collation it's not.
>
> About the Tibetan precomposed vowels:
>
> For the LDML spec, I submitted a CLDR ticket this morning:
> http://unicode.org/cldr/trac/ticket/5667

If we want to proceed along the current lines, then all we need is
'CFCD' (Collation FCD), which differs from FCD by excluding characters
that decompose to two or more characters of which none have canonical
combining class zero. The motivation for the sterner exclusion is
provided by adding the following contrived collating elements to the
a default collation:

<U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS>
<U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI>

Proper canonical closure then requires contractions for:
a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is
canonically equivalent to <U+03B1, U+0308, U+0301>,
b) <U+03B1, U+0344, U+0345>, and
c) <U+0344, U+0345>

Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW,
U+0344, U+0345>. Using the extended set of contractions, this
splits into the discontiguous collating elements <U+03B1, U+0344,
U+0345> and <U+0359>.

However, using the original contractions along with normalisation, we
obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301,
U+0345>, which in general will sort differently.

At first sight, the solution would be to prohibit <U+03B1, U+0344>
and <U+03B1, U+0344, U+0345> from being discontiguous collating
elements. But then the sequence will split into the collating
elements <U+03B1>, <U+0359>, <U+0344, U+0345> - yet a third set of
collating elements!

> For UTS #10 section 6.5, I just now submitted an error report on
> unicode.org .

> For ICU, we have a statement in the User Guide that overlaps between
> contractions and decomposition mappings are not supported, and we
> have a ticket for trying to fix this by building more data.

I believe the algorithm is less difficult than I thought. Having
reduced the set of characters in CFCD strings, the algorithm
simplifies. I use the following notation:

Let F be the set of all CFCD strings.
Let E(s) be the set of CFCD strings canonically equivalent to s.
Let U be the set of strings of length one.

Let T be a set of NFD collating elements. Then the canonical closure S
of T is the least set such that:

1) E(T) ⊂ S
2) If xu ∈ S, vy ∈ T, u and v are characters, and vy is the last
collation element in xuvy, then x(E(uv) ∩ U ∩ F)E(y) ⊂ S.

It may be argued that the concept of CFCD strings is too strict:

(i) The total exclusion of characters decomposing to two characters with
non-zero canonical combining class is too strict - I can't see any
problem in their following characters with trailing canonical class of
zero. This would cover most of the cases where they are likely to be
encountered.

(ii) It does not seem onerous for a collating application to decompose
the non-CFCD FCD characters on the fly.

(iii) Having a string in FCD is not very far from having it in NFD.
An iterator or stream delivering FCD can very simply be converted to
one delivering NFD - all that is required is a mapping from characters
to their NFD decompositions.

Richard.
Received on Tue Feb 12 2013 - 14:21:38 CST

This archive was generated by hypermail 2.2.0 : Tue Feb 12 2013 - 14:21:38 CST