Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 17 May 2012 21:32:19 -0700

On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 17 May 2012 15:42:37 -0700
> Markus Scherer <markus.icu_at_gmail.com> wrote:
>
> > On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham <
> > richard.wordingham_at_ntlworld.com> wrote:
>
> >> HOWEVER, you must *not* have the added contraction for 0F71+0F71.
>
> > If we don't have this prefix contraction, then we will miss a
> > discontiguous-contraction match on <0F71, 0334, 0F71, 0F72>.
>
> (a) <0F71, 0334, 0F71, 0F72> is not FCD.
>

Sorry, more coffee for me next time...

It's still possible to have FCD text that requires a discontiguous match
for the contraction 0F71+0F71+0F72. The text would add one more 0F71 at the
beginning which would have to be skipped, but the match fails if the prefix
contraction is missing.

(b) CE(<0F71, 0334, 0F71, 0F72>) = CE(0F71+0F72).CE(0334).CE(0F71).
>
> (c) Are you thinking of <0FB2, 0334, 0F71, 0F80>, with *REVERSED* I?
>

I wasn't specifically thinking of that...

As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71, and
> so CE(<0FB2, 0334, 0F71, 0F80>) = CE(0FB2+0F80).CE(0334).CE(0F71), and a
> strictly non-normalising tailoring therefore needs a contraction
> for 0FB2+0334+0F71+F80 = 0FB2+0334+0F81 to (i) strip the 0F80 from 0F81
> and (ii) prevent the contraction 0FB2+0F81.

Ok, but assuming we didn't add 0FB2+0F71, why can't we add the contraction
0FB2+0F81 and have the 0334 and any other non-starter be handled via
discontiguous matching?

And assuming we do add 0FB2+0F71 as requested in L2/12-131R, do we need
infinite overlap contractions? See this
spreadsheet<https://docs.google.com/spreadsheet/pub?key=0Ag3w_MjvUEoRdFVabUR5elltX3pObXNYRnV5VWNiRGc&output=html>
.

lccc(0F73) = ccc(0F71) = 129
> rccc(0F73) = ccc(0F72) = 130
>
> However, if we do not allow 0F71,0F71,0F71,0F73 to contract as
> 0F71+0F73,0F71,0F71, we need infinitely many contractions to handle
> pure (albeit highly dubious) Tibetan. We have to treat 0F73 as not
> being blocked by 0F71.
>

This is not clear to me, but I see an issue which might be what you are
trying to say.

The DUCET has the contraction 0F71+0F72, and we should find a discontiguous
match on <0F71, 0F71, 0F71, 0F72> skipping the two middle 0F71. That string
is equivalent to the FCD-passing string <0F71, 0F71, 0F73> but there is no
0F72 in sight there to complete the match if we don't modify the string.

If we cannot find a way to handle this with a finite (actually, small)
amount of data, then we either have to decompose those three Tibetan
composite vowels before they reach the core collation code, or, frankly, we
just document a limitation for ICU and point to the fact that the use of
these three characters is
"discouraged"<http://unicode.org/charts/PDF/U0F00.pdf>and they don't
occur in any normalized text (e.g., NFC).

The more I think about these the more I believe I could live with such a
limitation. If we could get our code to support all of UCA, provide a dozen
runtime attributes, compare strings and return two kinds of sort keys, be
fast, and deliver correct results on all FCD input except if these three
characters are involved, I would be quite happy.

Maybe we could lobby to change these characters to be "strongly
discouraged" or "deprecated" or "too hard to implement"...

markus

-- 
Google Internationalization Engineering
Received on Thu May 17 2012 - 23:38:47 CDT

This archive was generated by hypermail 2.2.0 : Thu May 17 2012 - 23:38:49 CDT