From: Markus Scherer <markus.icu_at_gmail.com>

Date: Tue, 2 Apr 2013 14:32:43 -0700

Date: Tue, 2 Apr 2013 14:32:43 -0700

Hi Richard,

I was looking again at your example where U+0344 causes bad results in

collation of FCD strings. See inline below.

On Tue, Feb 12, 2013 at 12:19 PM, Richard Wordingham <

richard.wordingham_at_ntlworld.com> wrote:

*> On Mon, 11 Feb 2013 17:13:58 -0800
*

*> Markus Scherer <markus.icu_at_gmail.com> wrote:
*

*>
*

*> > I would not revise FCD itself. For a number of processes, it is
*

*> > sufficient as is. For collation it's not.
*

*> >
*

*> > About the Tibetan precomposed vowels:
*

*> >
*

*> > For the LDML spec, I submitted a CLDR ticket this morning:
*

*> > http://unicode.org/cldr/trac/ticket/5667
*

*>
*

*> If we want to proceed along the current lines, then all we need is
*

*> 'CFCD' (Collation FCD), which differs from FCD by excluding characters
*

*> that decompose to two or more characters of which none have canonical
*

*> combining class zero. The motivation for the sterner exclusion is
*

*> provided by adding the following contrived collating elements to the
*

*> a default collation:
*

*>
*

*> <U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS>
*

*> <U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI>
*

*>
*

*> Proper canonical closure then requires contractions for:
*

*> a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is
*

*> canonically equivalent to <U+03B1, U+0308, U+0301>,
*

*> b) <U+03B1, U+0344, U+0345>, and
*

*> c) <U+0344, U+0345>
*

*>
*

This "proper canonical closure" assumes adding contractions for overlaps

between existing contractions and decomposition mappings.

Canonical closure will then also add the decompositions of b) and c):

d) <03B1, 0308, 0301, 0345>

e) <0308, 0301, 0345>

Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW,

*> U+0344, U+0345>. Using the extended set of contractions, this
*

*> splits into the discontiguous collating elements <U+03B1, U+0344,
*

*> U+0345> and <U+0359>.
*

*>
*

*> However, using the original contractions along with normalisation, we
*

*> obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301,
*

*> U+0345>, which in general will sort differently.
*

*>
*

This is true when "using the original contractions", but I would argue that

the goal of canonical closure is that *with the canonically-closed

mappings* we get the same result for FCD input text (minus the Tibetan

composite vowels) as for NFD input text -- but it will get different

results for NFD input as an implementation without overlap closure.

In your example: With the canonical closure adding contraction d) we obtain

the collating elements <03B1, 0308, 0301, 0345>, <0359> which will collate

the same as the FCD version.

I think we should remove U+0344 from the FCD

exclusions<http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Collation_Settings>where

I added it a few weeks ago. Instead, we should document that an

implementation (like ICU currently) which does not add the overlap

contractions will get some different FCD/NFD results, and an implementation

that does add the overlaps will get some different results for NFD than an

implementation that doesn't add the overlaps.

markus

Received on Tue Apr 02 2013 - 16:37:44 CDT

*
This archive was generated by hypermail 2.2.0
: Tue Apr 02 2013 - 16:37:45 CDT
*