Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Tue, 15 May 2012 21:33:03 -0700

On Tue, May 15, 2012 at 4:42 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> I am puzzled as to how an implementation can compliantly implement the
> tailoring of normalisation in the UCA.
>

I think you mean something like "implement tailorings where contractions
overlap with decomposition mappings" rather than tailoring of normalization.

Can an implementation be said to compliantly implement the tailoring of
> normalisation if nominally turning it off actually has no effect? If
> it can, my puzzlement goes away.
>
> Simply removing the normalisation step from the UCA leaves an
> implementation that does not correctly sort all FCD strings if there
> is no other normalisation, so the recommendation that implementations
> correctly sort FCD strings is disconcerting. It also makes me fear
> that describing a compliant tailoring in terms of DUCET version
> and 'defined' tailorings does not define the collation!
>

The definition of a tailoring is not the problem. It it supposed to work in
the expected way with a compliant implementation, regardless of how the
implementation achieves that.

An example, which probably makes no more linguistic sense than many of
> the cases in the tests provided for the UCA, is given by the FCD string
> <U+0F71 TIBETAN VOWEL SIGN AA, U+0F73 TIBETAN VOWEL SIGN II,
> U+0F80 TIBETAN VOWEL SIGN REVERSED I>. The UCD and UCA give this the
> collation element sequence of CEs of U+0F73 followed by the collation
> element sequence of CEs of U+0F81 TIBETAN VOWEL SIGN REVERSED II, but
> the DUCET weight table allkeys.txt gives no contraction for 0F71 0F73
> 0F80. Is a compliant implementation of the tailoring of normalisation
> expected to detect the need for this contraction?
>
> There is currently an issue that a family of FCD sequences such as
> <U+0FB2 U+0334 U+0F81> might actually need an infinite number of
> contractions (just keep repeating U+0334!), but that may yet be
> dismissed as arising from some omissions in DUCET.
>
> Does anyone believe they have a compliant normalisation tailoring of
> DUCET? Does it work for FCD strings? Unless I'm very much mistaken,
> ICU doesn't (http://bugs.icu-project.org/trac/ticket/9323).
>

I think this might be a duplicate of
http://bugs.icu-project.org/trac/ticket/8052

I know of at least two problems with ICU discontiguous-contraction
processing in this area (aside from not handling nested/interleaved
contractions):

1. It checks for ccc values, but the Tibetan composite vowels have ccc=0. I
believe this gets fixed by using the lccc value (ICU property, same as ccc
of the first character of the Decomposition_Mapping), like I do in my
prototype code. This is ticket #8052.

2. If we have a contraction a+b where b is the trailing vowel in the
decomposition of a Tibetan composite vowel (where the leading vowel has a
different ccc and can be skipped), then we can't find the discontiguous
contraction in text that contains the composite vowel. I have a TODO
question for this in my prototype code. I think the best way to handle this
might be to add a special test in the incremental FCD check for the Tibetan
composite vowels and force NFD normalization for the surrounding piece of
text even if it otherwise passes the FCD test. This would require turning
ICU's normalization mode on, but that is anyway necessary to get
canonically equivalent results when the input is not already in FCD.

Maybe I should even modify the ICU normalization FCD code (outside
collation) to always decompose the Tibetan composite vowels.

markus

-- 
Google Internationalization Engineering
Received on Tue May 15 2012 - 23:42:41 CDT

This archive was generated by hypermail 2.2.0 : Tue May 15 2012 - 23:43:16 CDT