Re: FCD and Collation

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Tue, 2 Apr 2013 14:32:43 -0700

Hi Richard,

I was looking again at your example where U+0344 causes bad results in
collation of FCD strings. See inline below.

On Tue, Feb 12, 2013 at 12:19 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Mon, 11 Feb 2013 17:13:58 -0800
> Markus Scherer <markus.icu_at_gmail.com> wrote:
>
> > I would not revise FCD itself. For a number of processes, it is
> > sufficient as is. For collation it's not.
> >
> > About the Tibetan precomposed vowels:
> >
> > For the LDML spec, I submitted a CLDR ticket this morning:
> > http://unicode.org/cldr/trac/ticket/5667
>
> If we want to proceed along the current lines, then all we need is
> 'CFCD' (Collation FCD), which differs from FCD by excluding characters
> that decompose to two or more characters of which none have canonical
> combining class zero. The motivation for the sterner exclusion is
> provided by adding the following contrived collating elements to the
> a default collation:
>
> <U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS>
> <U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI>
>
> Proper canonical closure then requires contractions for:
> a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is
> canonically equivalent to <U+03B1, U+0308, U+0301>,
> b) <U+03B1, U+0344, U+0345>, and
> c) <U+0344, U+0345>
>

This "proper canonical closure" assumes adding contractions for overlaps
between existing contractions and decomposition mappings.

Canonical closure will then also add the decompositions of b) and c):
d) <03B1, 0308, 0301, 0345>
e) <0308, 0301, 0345>

Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW,
> U+0344, U+0345>. Using the extended set of contractions, this
> splits into the discontiguous collating elements <U+03B1, U+0344,
> U+0345> and <U+0359>.
>
> However, using the original contractions along with normalisation, we
> obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301,
> U+0345>, which in general will sort differently.
>

This is true when "using the original contractions", but I would argue that
the goal of canonical closure is that *with the canonically-closed
mappings* we get the same result for FCD input text (minus the Tibetan
composite vowels) as for NFD input text -- but it will get different
results for NFD input as an implementation without overlap closure.

In your example: With the canonical closure adding contraction d) we obtain
the collating elements <03B1, 0308, 0301, 0345>, <0359> which will collate
the same as the FCD version.

I think we should remove U+0344 from the FCD
exclusions<http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Collation_Settings>where
I added it a few weeks ago. Instead, we should document that an
implementation (like ICU currently) which does not add the overlap
contractions will get some different FCD/NFD results, and an implementation
that does add the overlaps will get some different results for NFD than an
implementation that doesn't add the overlaps.

markus
Received on Tue Apr 02 2013 - 16:37:44 CDT

This archive was generated by hypermail 2.2.0 : Tue Apr 02 2013 - 16:37:45 CDT