Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Wed, 16 May 2012 09:17:51 -0700

On Wed, May 16, 2012 at 1:24 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> Section 5.1 of the UCA says that one may have a parametric
> normalisation tailoring.

Aha :-)
When you write "normalisation tailoring" it sounds like you are tailoring
the normalization algorithm or properties (UAX #15).

Section 5.1 is about runtime parameters/attributes applied orthogonally to
the specification of a Collation Element Table.

There is not as much to it as you seem to think. What normalization=off
does is turn off the first step of the UCA algorithm, namely NFD
normalization, and you should only do so if you know or assume that your
text is already normalized so step 1 would be a no-op. Then it recommends
that an implementation that offers this get the correct results if text is
in any form of FCD.

The UCA conformance statement does not explicitly cover behavior under
these parameters, I believe, but if an implementation gets bad results for
input for which it purports to get good ones, then that's a bug.

At least, if the bugs are limited to the three Tibetan composite vowels,
then most users and most text are not affected.

> > Does anyone believe they have a compliant normalisation tailoring of
> > > DUCET? Does it work for FCD strings? Unless I'm very much
> > > mistaken, ICU doesn't
> > > (http://bugs.icu-project.org/trac/ticket/9323).
>
> > I think this might be a duplicate of
> > http://bugs.icu-project.org/trac/ticket/8052
>
> Not quite. I believe pure Tibetan script can be sorted out by adding a
> finite number of 'contractions' (I am not sure whether they are valid
> for discontiguous contraction). No. 8052 needs a contraction <U+0FB2
> U+0334 U+0F81> and for each substring that behaves like U+0334,
> therefore an infinite set, therefore needing an algorithmic solution
> rather than just a bigger table.

I don't think we need more contractions here. The problem with ticket #8052
is that the code misses the opportunity for discontiguous-contraction
matching because it sees ccc(0F73)=0. In my prototype code I use the
"leading ccc" which is 129, and 0F73 does get skipped when appropriate,
hence the test case with 0F73 that you had wondered about.

Note to Åke: I believe it is insufficient to just look at the single value
lccc=129 for these characters. In my prototype code, after skipping one of
them, I continue with the "trailing ccc" value of 130 or 132 to check for
whether the next combining mark is blocked. I believe you need both the
leading and trailing values for correct FCD discontiguous contractions.

(I don't dispute that solving 8052 is
> likely to solve 9323.) However, it would surprise me if the collation
> behaviour of <U+0FB2 U+0334 U+0F81> were changed. In so far as it is
> linguistically meaningful, it is an error in DUCET that it doesn't
> sort the same as <U+0FB2 U+0F81 U+0334>. (Of course, Tibetan collation
> in DUCET is already very wrong for Tibetan script languages.)
>

It's not "an error in DUCET" because UCA step 1 is to apply NFD in which
case these will become the same string. And <U+0FB2 U+0F81 U+0334> does not
pass the FCD test, so it will get at least reordered (and maybe
decomposed). The only problem is if we compare <U+0FB2 U+0334 U+0F81>
with <U+0FB2 U+0334 U+0F71 U+0F80> where an FCD-based implementation should
find the contraction in the first string (if it checks for "leading ccc"
not ccc) but not in the second. The addition of the two missing prefix
contractions requested in L2/12-131R will fix that.

> Maybe I should even modify the ICU normalization FCD code (outside
> > collation) to always decompose the Tibetan composite vowels.
>
> Certainly the safest method! As the precomposed vowels are deprecated,
> it even has merit independent of getting collation to work. Another
> alternative would be to process part of a composite character when
> forming the collation, but that gets messy.
>

I took another look at allkeys.txt. As far as I can tell, the problematic
characters (trailing parts 0F72, 0F74, 0F80 of the one-higher composite
vowels) occur only in contractions that correspond to the composites
themselves and in contractions like 0FB2 0F80 which are ok: If we get input
like 0FB2 0F81 we need not match the second part of 0F81 because 0FB2 0F81
itself (and 0FB2 0F71 0F80) is also a DUCET contraction.

So probably the simplest way to deal with contractions that contain 0F72,
0F74, 0F80 is to either forbid them in tailorings or to require that there
also be contractions that instead contain 0F73, 0F75, 0F81 respectively.

markus

-- 
Google Internationalization Engineering
Received on Wed May 16 2012 - 11:21:39 CDT

This archive was generated by hypermail 2.2.0 : Wed May 16 2012 - 11:21:52 CDT