Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 17 May 2012 15:42:37 -0700

On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> If using DUCET, the collation elements for 0F71+0F71+0F72 are those for
> <0F73, 0F71>, namely (at 6.1.0):
>
> [.2572.0020.0002.0F73][.2570.0020.0002.0F71].
>
> The correct collation elements for FCD sequence 0F71,0F71,0F72,0F72
> are:
>
> [.2572.0020.0002.0F73][.2572.0020.0002.0F73]
>
> However, if we don't have a contraction (within the
> strictly non-normalising tailoring) for 0F71+0F71+0F72+0F72, we will
> incorrectly derive the collation element sequence
>
> [.2572.0020.0002.0F73][.2570.0020.0002.0F71][.2571.0020.0002.0F72]
>

I see. It's because 0F71+0F73==0F73+0F71 (for canonical closure we should
have both versions) and the latter overlaps with decomposition mappings.
Sigh...

HOWEVER, you must *not* have the added contraction for 0F71+0F71.
>

If we don't have this prefix contraction, then we will miss a
discontiguous-contraction match on <0F71, 0334, 0F71, 0F72>.

Within the tailoring, 0F73 must have ccc 130.

No, it has ccc=0. I believe that an FCD-accepting implementation should
work with the "leading ccc" and "trailing ccc" values rather than ccc
itself.

Do bear in mind that DUCET 6.1.0 requires an infinite set of
> contractions if you are to collate FCD strings without doing some
> normalisation, such as splitting the Tibetan long vowels.
>

I will re-read your earlier emails to see if this is really the case. And
in Q3 I will try to write code to find the necessary overlaps between
contractions and decomposition mappings.

Thanks,
markus
Received on Thu May 17 2012 - 17:46:24 CDT

This archive was generated by hypermail 2.2.0 : Thu May 17 2012 - 17:46:24 CDT