From: Richard Wordingham <richard.wordingham_at_ntlworld.com>

Date: Tue, 12 Feb 2013 20:19:19 +0000

Date: Tue, 12 Feb 2013 20:19:19 +0000

On Mon, 11 Feb 2013 17:13:58 -0800

Markus Scherer <markus.icu_at_gmail.com> wrote:

*> I would not revise FCD itself. For a number of processes, it is
*

*> sufficient as is. For collation it's not.
*

*>
*

*> About the Tibetan precomposed vowels:
*

*>
*

*> For the LDML spec, I submitted a CLDR ticket this morning:
*

*> http://unicode.org/cldr/trac/ticket/5667
*

If we want to proceed along the current lines, then all we need is

'CFCD' (Collation FCD), which differs from FCD by excluding characters

that decompose to two or more characters of which none have canonical

combining class zero. The motivation for the sterner exclusion is

provided by adding the following contrived collating elements to the

a default collation:

<U+03B1 GREEK SMALL LETTER ALPHA, U+0308 COMBINING DIAERESIS>

<U+0301 COMBINING ACUTE ACCENT, U+0345 COMBINING GREEK YPOGEGRAMMENI>

Proper canonical closure then requires contractions for:

a) <U+03B1, U+0344 COMBINING GREEK DIALYTIKA TONOS> - this sequence is

canonically equivalent to <U+03B1, U+0308, U+0301>,

b) <U+03B1, U+0344, U+0345>, and

c) <U+0344, U+0345>

Now consider the sequence <U+03B1, U+0359 COMBINING ASTERISK BELOW,

U+0344, U+0345>. Using the extended set of contractions, this

splits into the discontiguous collating elements <U+03B1, U+0344,

U+0345> and <U+0359>.

However, using the original contractions along with normalisation, we

obtain the collating elements <U+03B1, U+0308>, <U+0359>, <U+0301,

U+0345>, which in general will sort differently.

At first sight, the solution would be to prohibit <U+03B1, U+0344>

and <U+03B1, U+0344, U+0345> from being discontiguous collating

elements. But then the sequence will split into the collating

elements <U+03B1>, <U+0359>, <U+0344, U+0345> - yet a third set of

collating elements!

*> For UTS #10 section 6.5, I just now submitted an error report on
*

*> unicode.org .
*

*> For ICU, we have a statement in the User Guide that overlaps between
*

*> contractions and decomposition mappings are not supported, and we
*

*> have a ticket for trying to fix this by building more data.
*

I believe the algorithm is less difficult than I thought. Having

reduced the set of characters in CFCD strings, the algorithm

simplifies. I use the following notation:

Let F be the set of all CFCD strings.

Let E(s) be the set of CFCD strings canonically equivalent to s.

Let U be the set of strings of length one.

Let T be a set of NFD collating elements. Then the canonical closure S

of T is the least set such that:

1) E(T) ⊂ S

2) If xu ∈ S, vy ∈ T, u and v are characters, and vy is the last

collation element in xuvy, then x(E(uv) ∩ U ∩ F)E(y) ⊂ S.

It may be argued that the concept of CFCD strings is too strict:

(i) The total exclusion of characters decomposing to two characters with

non-zero canonical combining class is too strict - I can't see any

problem in their following characters with trailing canonical class of

zero. This would cover most of the cases where they are likely to be

encountered.

(ii) It does not seem onerous for a collating application to decompose

the non-CFCD FCD characters on the fly.

(iii) Having a string in FCD is not very far from having it in NFD.

An iterator or stream delivering FCD can very simply be converted to

one delivering NFD - all that is required is a mapping from characters

to their NFD decompositions.

Richard.

Received on Tue Feb 12 2013 - 14:21:38 CST

*
This archive was generated by hypermail 2.2.0
: Tue Feb 12 2013 - 14:21:38 CST
*