Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <>
Date: Wed, 16 May 2012 09:24:19 +0100

On Tue, 15 May 2012 21:33:03 -0700
Markus Scherer <> wrote:

> On Tue, May 15, 2012 at 4:42 PM, Richard Wordingham <
>> wrote:
> > I am puzzled as to how an implementation can compliantly implement
> > the tailoring of normalisation in the UCA.

> I think you mean something like "implement tailorings where
> contractions overlap with decomposition mappings" rather than
> tailoring of normalization.

No. Brute force use of NFD solves most problems.

> Can an implementation be said to compliantly implement the tailoring
> of
> > normalisation if nominally turning it off actually has no effect?
> > If it can, my puzzlement goes away.

> The definition of a tailoring is not the problem. It it supposed to
> work in the expected way with a compliant implementation, regardless
> of how the implementation achieves that.

Section 5.1 of the UCA says that one may have a parametric
normalisation tailoring. Unfortunately, it is not clear to me how one
demonstrates that a normalisation tailoring of 'off' may have been or
has not been implemented correctly. Possibly it is any (necessarily
non-'Unicode compliant') collation that correctly sorts NFD (or is
it FCD?) strings but fails for some other strings. In which case, is it
necessary for it to fail for at least some strings? Obviously there
are inequivalent collations that achieve these effects. (For example,
my blocking test assumes that the string it is working is in NFD. If
it did not, then the results given an arbitrary string would be

Now, the concept of a parametric normalisation tailoring could
be a confusion with the concept of having a function interface that
requires that input strings (as strings of codepoints, rather than as
text) be in a suitable format.

> > Does anyone believe they have a compliant normalisation tailoring of
> > DUCET? Does it work for FCD strings? Unless I'm very much
> > mistaken, ICU doesn't
> > (
> I think this might be a duplicate of

Not quite. I believe pure Tibetan script can be sorted out by adding a
finite number of 'contractions' (I am not sure whether they are valid
for discontiguous contraction). No. 8052 needs a contraction <U+0FB2
U+0334 U+0F81> and for each substring that behaves like U+0334,
therefore an infinite set, therefore needing an algorithmic solution
rather than just a bigger table. (I don't dispute that solving 8052 is
likely to solve 9323.) However, it would surprise me if the collation
behaviour of <U+0FB2 U+0334 U+0F81> were changed. In so far as it is
linguistically meaningful, it is an error in DUCET that it doesn't
sort the same as <U+0FB2 U+0F81 U+0334>. (Of course, Tibetan collation
in DUCET is already very wrong for Tibetan script languages.)

> Maybe I should even modify the ICU normalization FCD code (outside
> collation) to always decompose the Tibetan composite vowels.

Certainly the safest method! As the precomposed vowels are deprecated,
it even has merit independent of getting collation to work. Another
alternative would be to process part of a composite character when
forming the collation, but that gets messy.

Received on Wed May 16 2012 - 03:28:31 CDT

This archive was generated by hypermail 2.2.0 : Wed May 16 2012 - 03:28:33 CDT