Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 17 May 2012 23:00:51 +0100

On Thu, 17 May 2012 13:39:08 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Thu, May 17, 2012 at 1:02 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > As x = 0F71, we also need the
> > contractions of x+0F73 (or x+0F71+0F72) with 0F72, 0F74 and 0F80 to
> > give the pair of long vowels. We don't need to worry about
> > <x+0F73,0F73> because that is not FCD.
 
> I am not following.
 
> Given contractions
 
> 0F71+0F71 (needed as a prefix of the next one)
  
> 0F71+0F73
 
> what other contractions do we need to add to avoid which problem?

If using DUCET, the collation elements for 0F71+0F71+0F72 are those for
<0F73, 0F71>, namely (at 6.1.0):

[.2572.0020.0002.0F73][.2570.0020.0002.0F71].

The correct collation elements for FCD sequence 0F71,0F71,0F72,0F72
are:

[.2572.0020.0002.0F73][.2572.0020.0002.0F73]

However, if we don't have a contraction (within the
strictly non-normalising tailoring) for 0F71+0F71+0F72+0F72, we will
incorrectly derive the collation element sequence

[.2572.0020.0002.0F73][.2570.0020.0002.0F71][.2571.0020.0002.0F72]

If we also have the contraction for 0F71+0F73+0F72 (implied by
0F71+0F71+0F2+0F72, I believe, through canonical closure), from this
point on we can be safe - every length mark (U+0F71) has its short
vowel, and appending a further U+0F71 would result in a non-FCD
sequence.

HOWEVER, you must *not* have the added contraction for 0F71+0F71.
Within the tailoring, 0F73 must have ccc 130. (This works because
only 0F71 has ccc 129.) Then, given an FCD sequence 0F71, 0F71, 0F73,
0F72, 0F72, this will contract to 0F71+0F73+0F72, 0F71, 0F72 and then to
0F71+0F73+0F72, 0F71+0F72, and we get the right collation elements.

> In principle, you are right. However, such a contraction is such a
> weird case that I think we could just forbid it. That is, forbid a
> set of contractions that would cause us to add infinite overlap
> contractions.

Do bear in mind that DUCET 6.1.0 requires an infinite set of
contractions if you are to collate FCD strings without doing some
normalisation, such as splitting the Tibetan long vowels.

Richard.
Received on Thu May 17 2012 - 17:03:28 CDT

This archive was generated by hypermail 2.2.0 : Thu May 17 2012 - 17:03:28 CDT