Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Wed, 16 May 2012 16:03:08 -0700

On Wed, May 16, 2012 at 2:54 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> The tailoring 'locale' is not orthogonal.
>

Well, right, that one selects the Collation Element Table :-)

The tailoring 'caseFirst' rather reshuffles the tertiary weights. I am
> not entirely convinced it is orthogonal

I suppose you could build a separate table depending on case options, but
it's not necessary. Instead, we modify the collation elements coming out of
UCA Step 2.

Similar remarks apply to 'reorder'. What if I move 'Q' and 'q' into
> the Cyrillic sequence?

Same here. We use a permutation on collation elements coming out of Step 2.
If a character has a primary weight in the Cyrillic primary-weights range,
then it gets reordered together with all of the other characters in that
range.

> There is not as much to it as you seem to think. What
> > normalization=off does is turn off the first step of the UCA
> > algorithm, namely NFD normalization, and you should only do so if you
> > know or assume that your text is already normalized so step 1 would
> > be a no-op. Then it recommends that an implementation that offers
> > this get the correct results if text is in any form of FCD.
>
> I presume the UCA and the Unicode Locale Data Markup Language (LDML)
> are meant to be aligned. In the LDML definition
> (http://unicode.org/reports/tr35/#Collation_Elements), it says,
> "If 'on', then the normal [UCA] algorithm is used. If 'off', then all
> strings that are in [FCD] will sort correctly, but others will not
> necessarily sort correctly". 'Will' is stronger than 'should'.
>

We have bugs in ICU, and known limitations (but we are reviewing them).

> The UCA conformance statement does not explicitly cover behavior under
> > these parameters, I believe, but if an implementation gets bad
> > results for input for which it purports to get good ones, then that's
> > a bug.
>
> The way I am now reading this is that if a collation is tailored with
> normalisation 'off', then it is the responsibility of the user to
> only use FCD strings, and if he does not he cannot rely on its
> definitions being honoured.

Right.

How a tailorable implementation implements
> this is up to it - it might choose to ignore the optimisation
> opportunity and always perform the NFD normalisation, it might
> decompose but not reorder, or it might use some subtler techique such
> as decomposing Tibetan vowels and applying singleton decompositions
> (e.g. 212B ANGSTROM SIGN) and decomposing characters whose decomposition
> starts with 'A' or 'a' (for Danish sanity!). The mechanism chosen
> would have to depend on the tailorings applied.
>

Right. An implementation can also choose not to offer the option, or any
options.

> It's not "an error in DUCET" because UCA step 1 is to apply NFD in
> > which case these will become the same string. And <U+0FB2 U+0F81
> > U+0334> does not pass the FCD test, so it will get at least reordered
> > (and maybe decomposed). The only problem is if we compare <U+0FB2
> > U+0334 U+0F81> with <U+0FB2 U+0334 U+0F71 U+0F80> where an FCD-based
> > implementation should find the contraction in the first string (if it
> > checks for "leading ccc" not ccc) but not in the second. The addition
> > of the two missing prefix contractions requested in L2/12-131R will
> > fix that.
>
> And the absence of those prefix contractions is the error. (The problem
> arises because 0F71 has non-zero ccc.)
>

Right. It's not strictly an error in the sense that the UCA Algorithm works
as written, with Step 1 applying NFD.

However, the missing prefix contractions make some FCD sequences not behave
in a canonically equivalent way, which goes against the intention.

An irritating consequence of adding a contraction for <U+0FB2 U+0F71> is
> that we THEN also need the hitherto redundant contractions of that
> prefix with the short vowels U+0F72 TIBETAN VOWEL SIGN I and U+0F74
> TIBETAN VOWEL SIGN U to get subjoined RA plus non-consonantal long
> vowels to collate properly.
>

Oh, to avoid the "Danish problem" of a contraction overlapping with a
decomposition mapping. Yes, you are right.

Adding overlap contractions could be done automatically (they are based on
canonical equivalence), while missing prefix contractions are safer done
manually (their behavior should be reviewed and chosen explicitly).

> I took another look at allkeys.txt. As far as I can tell, the
> > problematic characters (trailing parts 0F72, 0F74, 0F80 of the
> > one-higher composite vowels) occur only in contractions that
> > correspond to the composites themselves and in contractions like 0FB2
> > 0F80 which are ok: If we get input like 0FB2 0F81 we need not match
> > the second part of 0F81 because 0FB2 0F81 itself (and 0FB2 0F71 0F80)
> > is also a DUCET contraction.
> >
> > So probably the simplest way to deal with contractions that contain
> > 0F72, 0F74, 0F80 is to either forbid them in tailorings or to require
> > that there also be contractions that instead contain 0F73, 0F75, 0F81
> > respectively.
>
> I think you mean the other way round.

No.

The problem is a contraction x+0F72 and input text x+0F73 where the inner
0F71 should be skipped. We can avoid this by adding a contraction for
x+0F73 (and one for the equivalent x+0F71+0F72).

On the other hand, x+0F73 (together with x+0F71+0F72) is harmless, it does
not match the second half of anything else. Separately, we should have the
prefix contraction x+0F71 so that discontiguous contractions match as
expected, but we don't need x+0F72.

markus
Received on Wed May 16 2012 - 18:06:43 CDT

This archive was generated by hypermail 2.2.0 : Wed May 16 2012 - 18:06:44 CDT