Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <>
Date: Thu, 17 May 2012 21:32:19 -0700

On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham <> wrote:

> On Thu, 17 May 2012 15:42:37 -0700
> Markus Scherer <> wrote:
> > On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham <
> >> wrote:
> >> HOWEVER, you must *not* have the added contraction for 0F71+0F71.
> > If we don't have this prefix contraction, then we will miss a
> > discontiguous-contraction match on <0F71, 0334, 0F71, 0F72>.
> (a) <0F71, 0334, 0F71, 0F72> is not FCD.

Sorry, more coffee for me next time...

It's still possible to have FCD text that requires a discontiguous match
for the contraction 0F71+0F71+0F72. The text would add one more 0F71 at the
beginning which would have to be skipped, but the match fails if the prefix
contraction is missing.

(b) CE(<0F71, 0334, 0F71, 0F72>) = CE(0F71+0F72).CE(0334).CE(0F71).
> (c) Are you thinking of <0FB2, 0334, 0F71, 0F80>, with *REVERSED* I?

I wasn't specifically thinking of that...

As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71, and
> so CE(<0FB2, 0334, 0F71, 0F80>) = CE(0FB2+0F80).CE(0334).CE(0F71), and a
> strictly non-normalising tailoring therefore needs a contraction
> for 0FB2+0334+0F71+F80 = 0FB2+0334+0F81 to (i) strip the 0F80 from 0F81
> and (ii) prevent the contraction 0FB2+0F81.

Ok, but assuming we didn't add 0FB2+0F71, why can't we add the contraction
0FB2+0F81 and have the 0334 and any other non-starter be handled via
discontiguous matching?

And assuming we do add 0FB2+0F71 as requested in L2/12-131R, do we need
infinite overlap contractions? See this

lccc(0F73) = ccc(0F71) = 129
> rccc(0F73) = ccc(0F72) = 130
> However, if we do not allow 0F71,0F71,0F71,0F73 to contract as
> 0F71+0F73,0F71,0F71, we need infinitely many contractions to handle
> pure (albeit highly dubious) Tibetan. We have to treat 0F73 as not
> being blocked by 0F71.

This is not clear to me, but I see an issue which might be what you are
trying to say.

The DUCET has the contraction 0F71+0F72, and we should find a discontiguous
match on <0F71, 0F71, 0F71, 0F72> skipping the two middle 0F71. That string
is equivalent to the FCD-passing string <0F71, 0F71, 0F73> but there is no
0F72 in sight there to complete the match if we don't modify the string.

If we cannot find a way to handle this with a finite (actually, small)
amount of data, then we either have to decompose those three Tibetan
composite vowels before they reach the core collation code, or, frankly, we
just document a limitation for ICU and point to the fact that the use of
these three characters is
"discouraged"<>and they don't
occur in any normalized text (e.g., NFC).

The more I think about these the more I believe I could live with such a
limitation. If we could get our code to support all of UCA, provide a dozen
runtime attributes, compare strings and return two kinds of sort keys, be
fast, and deliver correct results on all FCD input except if these three
characters are involved, I would be quite happy.

Maybe we could lobby to change these characters to be "strongly
discouraged" or "deprecated" or "too hard to implement"...


Google Internationalization Engineering
Received on Thu May 17 2012 - 23:38:47 CDT

This archive was generated by hypermail 2.2.0 : Thu May 17 2012 - 23:38:49 CDT