Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <>
Date: Fri, 18 May 2012 00:29:36 +0100

On Thu, 17 May 2012 15:42:37 -0700
Markus Scherer <> wrote:

> On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham <
>> wrote:

>> HOWEVER, you must *not* have the added contraction for 0F71+0F71.

> If we don't have this prefix contraction, then we will miss a
> discontiguous-contraction match on <0F71, 0334, 0F71, 0F72>.

(a) <0F71, 0334, 0F71, 0F72> is not FCD.

(b) CE(<0F71, 0334, 0F71, 0F72>) = CE(0F71+0F72).CE(0334).CE(0F71).

(c) Are you thinking of <0FB2, 0334, 0F71, 0F80>, with *REVERSED* I?
As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71, and
so CE(<0FB2, 0334, 0F71, 0F80>) = CE(0FB2+0F80).CE(0334).CE(0F71), and a
strictly non-normalising tailoring therefore needs a contraction
for 0FB2+0334+0F71+F80 = 0FB2+0334+0F81 to (i) strip the 0F80 from 0F81
and (ii) prevent the contraction 0FB2+0F81. Similarly, we need
contractions for 0FB2+0334+0334+0F81 0FB2+05B1+0F81 0FB2+0FB1+0334+0F81
0FB2+05B2+0F81 and so on ad infinitum.
>> Within the tailoring, 0F73 must have ccc 130.
> No, it has ccc=0. I believe that an FCD-accepting implementation
> should work with the "leading ccc" and "trailing ccc" values rather
> than ccc itself.

lccc(0F73) = ccc(0F71) = 129
rccc(0F73) = ccc(0F72) = 130

However, if we do not allow 0F71,0F71,0F71,0F73 to contract as
0F71+0F73,0F71,0F71, we need infinitely many contractions to handle
pure (albeit highly dubious) Tibetan. We have to treat 0F73 as not
being blocked by 0F71.

> > Do bear in mind that DUCET 6.1.0 requires an infinite set of
> > contractions if you are to collate FCD strings without doing some
> > normalisation, such as splitting the Tibetan long vowels.
> I will re-read your earlier emails to see if this is really the case.

See point (c) above.

Received on Thu May 17 2012 - 18:33:08 CDT

This archive was generated by hypermail 2.2.0 : Thu May 17 2012 - 18:33:10 CDT