Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Fri, 18 May 2012 09:21:27 +0100

On Thu, 17 May 2012 21:32:19 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:

> > As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71,
> > and
> > so CE(<0FB2, 0334, 0F71, 0F80>) = CE(0FB2+0F80).CE(0334).CE(0F71),
> > and a strictly non-normalising tailoring therefore needs a
> > contraction for 0FB2+0334+0F71+F80 = 0FB2+0334+0F81 to (i) strip
> > the 0F80 from 0F81 and (ii) prevent the contraction 0FB2+0F81.

> Ok, but assuming we didn't add 0FB2+0F71, why can't we add the
> contraction 0FB2+0F81 and have the 0334 and any other non-starter be
> handled via discontiguous matching?

Because then we wouldn't have DUCET 6.1.0, but instead would probably
have DUCET 6.2.0.

> And assuming we do add 0FB2+0F71 as requested in L2/12-131R, do we
> need infinite overlap contractions? See this
> spreadsheet<https://docs.google.com/spreadsheet/pub?key=0Ag3w_MjvUEoRdFVabUR5elltX3pObXNYRnV5VWNiRGc&output=html>
> .

I've started the process of requesting the four 'overlap' contractions.

I believe we won't need an infinity of overlap contractions if we add
0FB2+0F71. But we're then talking about DUCET 6.2.0, which doesn't yet
exist.

> lccc(0F73) = ccc(0F71) = 129
> > rccc(0F73) = ccc(0F72) = 130

> The DUCET has the contraction 0F71+0F72, and we should find a
> discontiguous match on <0F71, 0F71, 0F71, 0F72> skipping the two
> middle 0F71. That string is equivalent to the FCD-passing string
> <0F71, 0F71, 0F73> but there is no 0F72 in sight there to complete
> the match if we don't modify the string.

But if we have the implementation-generated contractions for 0F71+0F73
and 0F71+0F73+0F72 (and the other pairs based on pairs of vowels from
0F72, 0F74 and 0F80), and F073 (and the other long vowels) are not
blocked by 0F71, we're OK for UCA 6.1.0 at least as far back as UCA
4.1.0. (A collation has to cite a UCA/DUCET version to be fully
specified!) Now, these are contractions are for non-normalised
operation, so the lack of 0F71+0F71 is probably legal beyond UCA 6.1.0
- non-normalised collations have to work for FCD, they don't have to be
well-formed.

> If we cannot find a way to handle this with a finite (actually, small)
> amount of data, then we either have to decompose those three Tibetan
> composite vowels before they reach the core collation code, or,
> frankly, we just document a limitation for ICU and point to the fact
> that the use of these three characters is
> "discouraged"<http://unicode.org/charts/PDF/U0F00.pdf>and they don't
> occur in any normalized text (e.g., NFC).
>
> The more I think about these the more I believe I could live with
> such a limitation. If we could get our code to support all of UCA,
> provide a dozen runtime attributes, compare strings and return two
> kinds of sort keys, be fast, and deliver correct results on all FCD
> input except if these three characters are involved, I would be quite
> happy.

Solve the Danish blemish before dismissing Tibetan. The solution to
both might be to decompose certain (generally
collation-dependent) characters on FCD input. DUCET 6.2.0 will also need
infinitely many contractions if another combining character is added
with CCC equal to 129.

Richard.
Received on Fri May 18 2012 - 03:26:36 CDT

This archive was generated by hypermail 2.2.0 : Fri May 18 2012 - 03:26:37 CDT