Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 16 May 2012 00:42:46 +0100

I am puzzled as to how an implementation can compliantly implement the
tailoring of normalisation in the UCA.

Can an implementation be said to compliantly implement the tailoring of
normalisation if nominally turning it off actually has no effect? If
it can, my puzzlement goes away.

Simply removing the normalisation step from the UCA leaves an
implementation that does not correctly sort all FCD strings if there
is no other normalisation, so the recommendation that implementations
correctly sort FCD strings is disconcerting. It also makes me fear
that describing a compliant tailoring in terms of DUCET version
and 'defined' tailorings does not define the collation!

An example, which probably makes no more linguistic sense than many of
the cases in the tests provided for the UCA, is given by the FCD string
<U+0F71 TIBETAN VOWEL SIGN AA, U+0F73 TIBETAN VOWEL SIGN II,
U+0F80 TIBETAN VOWEL SIGN REVERSED I>. The UCD and UCA give this the
collation element sequence of CEs of U+0F73 followed by the collation
element sequence of CEs of U+0F81 TIBETAN VOWEL SIGN REVERSED II, but
the DUCET weight table allkeys.txt gives no contraction for 0F71 0F73
0F80. Is a compliant implementation of the tailoring of normalisation
expected to detect the need for this contraction?

There is currently an issue that a family of FCD sequences such as
<U+0FB2 U+0334 U+0F81> might actually need an infinite number of
contractions (just keep repeating U+0334!), but that may yet be
dismissed as arising from some omissions in DUCET.

Does anyone believe they have a compliant normalisation tailoring of
DUCET? Does it work for FCD strings? Unless I'm very much mistaken,
ICU doesn't (http://bugs.icu-project.org/trac/ticket/9323).

Richard.
Received on Tue May 15 2012 - 18:46:40 CDT

This archive was generated by hypermail 2.2.0 : Tue May 15 2012 - 18:46:41 CDT