Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 16 May 2012 22:54:31 +0100

On Wed, 16 May 2012 09:17:51 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Wed, May 16, 2012 at 1:24 AM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > Section 5.1 of the UCA says that one may have a parametric
> > normalisation tailoring.

> Section 5.1 is about runtime parameters/attributes applied
> orthogonally to the specification of a Collation Element Table.

For the tailorings 'strength', 'alternate', 'backwards', 'caseLevel',
and variableTop, I agree. For 'numeric' I expect you're right, but I
can imagine complications.

The tailoring 'locale' is not orthogonal.

The tailoring 'caseFirst' rather reshuffles the tertiary weights. I am
not entirely convinced it is orthogonal, and I'm not sure how it should
interact with the ordering of Danish 'aa', 'AA', 'Aa', 'aA', '' and
''. It makes sense if all tailorings start with the DUCET (which is
the only case that the UCA definition cares about) and it is applied
before any reorderings of characters, but I am not sure that it is
orthogonal. What if I choose to have A <<< a < b <<< B on top of a
fixed ordering for the other pairs? Is it still orthogonal?

Similar remarks apply to 'reorder'. What if I move 'Q' and 'q' into
the Cyrillic sequence? (I've a recollection that this letter is used
in Kurdish written in Cyrillic.) I have been wondering if U+0078 LATIN
SMALL LETTER X should be made common script because of its use for
displaying Lao vowels, but perhaps the principle of separation of
scripts should lead to LAO LETTER SMALL X.

I can conceive of complications for hiraganaQuaternary if one
individually tailors quaternary weights! (The tertiary equality of
some of the mathematical letters feels wrong to me, though there may be
better ways of sorting that anomaly out than playing with quaternary
weights.)

> There is not as much to it as you seem to think. What
> normalization=off does is turn off the first step of the UCA
> algorithm, namely NFD normalization, and you should only do so if you
> know or assume that your text is already normalized so step 1 would
> be a no-op. Then it recommends that an implementation that offers
> this get the correct results if text is in any form of FCD.

I presume the UCA and the Unicode Locale Data Markup Language (LDML)
are meant to be aligned. In the LDML definition
(http://unicode.org/reports/tr35/#Collation_Elements), it says,
"If 'on', then the normal [UCA] algorithm is used. If 'off', then all
strings that are in [FCD] will sort correctly, but others will not
necessarily sort correctly". 'Will' is stronger than 'should'.

> The UCA conformance statement does not explicitly cover behavior under
> these parameters, I believe, but if an implementation gets bad
> results for input for which it purports to get good ones, then that's
> a bug.

The way I am now reading this is that if a collation is tailored with
normalisation 'off', then it is the responsibility of the user to
only use FCD strings, and if he does not he cannot rely on its
definitions being honoured. How a tailorable implementation implements
this is up to it - it might choose to ignore the optimisation
opportunity and always perform the NFD normalisation, it might
decompose but not reorder, or it might use some subtler techique such
as decomposing Tibetan vowels and applying singleton decompositions
(e.g. 212B ANGSTROM SIGN) and decomposing characters whose decomposition
starts with 'A' or 'a' (for Danish sanity!). The mechanism chosen
would have to depend on the tailorings applied.

Is this interpretation correct? No-one has confirmed that a tailoring
of normalisation need not have any effect.

> (I don't dispute that solving 8052 is
> > likely to solve 9323.) However, it would surprise me if the
> > collation behaviour of <U+0FB2 U+0334 U+0F81> were changed. In so
> > far as it is linguistically meaningful, it is an error in DUCET
> > that it doesn't sort the same as <U+0FB2 U+0F81 U+0334>. (Of
> > course, Tibetan collation in DUCET is already very wrong for
> > Tibetan script languages.)

Correction: "would NOT surprise me".

> It's not "an error in DUCET" because UCA step 1 is to apply NFD in
> which case these will become the same string. And <U+0FB2 U+0F81
> U+0334> does not pass the FCD test, so it will get at least reordered
> (and maybe decomposed). The only problem is if we compare <U+0FB2
> U+0334 U+0F81> with <U+0FB2 U+0334 U+0F71 U+0F80> where an FCD-based
> implementation should find the contraction in the first string (if it
> checks for "leading ccc" not ccc) but not in the second. The addition
> of the two missing prefix contractions requested in L2/12-131R will
> fix that.

And the absence of those prefix contractions is the error. (The problem
arises because 0F71 has non-zero ccc.)

An irritating consequence of adding a contraction for <U+0FB2 U+0F71> is
that we THEN also need the hitherto redundant contractions of that
prefix with the short vowels U+0F72 TIBETAN VOWEL SIGN I and U+0F74
TIBETAN VOWEL SIGN U to get subjoined RA plus non-consonantal long
vowels to collate properly.

> I took another look at allkeys.txt. As far as I can tell, the
> problematic characters (trailing parts 0F72, 0F74, 0F80 of the
> one-higher composite vowels) occur only in contractions that
> correspond to the composites themselves and in contractions like 0FB2
> 0F80 which are ok: If we get input like 0FB2 0F81 we need not match
> the second part of 0F81 because 0FB2 0F81 itself (and 0FB2 0F71 0F80)
> is also a DUCET contraction.
>
> So probably the simplest way to deal with contractions that contain
> 0F72, 0F74, 0F80 is to either forbid them in tailorings or to require
> that there also be contractions that instead contain 0F73, 0F75, 0F81
> respectively.

I think you mean the other way round. And I would read a contraction
containing a Tibetan long vowel II, REVERSED II, UU, RR or LL as
containing the length mark U+0F71 and the corresponding short vowel.

Richard.
Received on Wed May 16 2012 - 16:59:51 CDT

This archive was generated by hypermail 2.2.0 : Wed May 16 2012 - 16:59:52 CDT