Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 20 May 2012 16:15:24 +0100

On Sat, 19 May 2012 01:12:17 +0100
Richard Wordingham <richard.wordingham_at_ntlworld.com> wrote:

> Just in case you haven't already thought of it, one reasonable scheme
> would be to decompose input if and only if searching for contractions
> or the input character could *hide* the start of a contraction, e.g.
> one starting with a combining accent or the non-initial part of an
> Indic vowel.

You may think the suggestions about hiders is excessive, but a real
example of hiding occurs when subjecting the current Lithuanian
collation in CLDR, which has a humanly unreadable contraction making
0307+0301 collate the same as U+0301 so as to undo ill-effects of
soft-dottedness, to arbitrary FCD strings. U+0117 LATIN SMALL LETTER E
WITH DOT ABOVE is protected from this contraction because it is the
subject of yet another contraction. However, even with full
optimisation switched on, the ICU demonstrator sorts NFC & FCD string
<U+0227 LATIN SMALL LETTER A WITH DOT ABOVE, U+0301> differently to its
NFD equivalent <U+0061, U+0307, U+0301>, which, in accordance with
collation rules, sorts identically to U+00E1 LATIN SMALL LETTER A WITH
ACUTE. Toggling the normalisation setting has no effect on the ICU
outcome. I don't if ICU needs another bug report.

Formally, one could just handle it with 22 time 2 (case) times 3
(Lithuanian intonation accents) = 132 derived contractions as opposed
to tagging 46 (44 if clever) characters as needing decomposition.

For the general case, we ought to be able to express a rule such as
'ignore the countering of sof-dottedness', as in Lithuanian casing, but
I don't see any finite method of expressing it under the UCA, just as
handling Thai-style preposed vowels requires a great many contractions,
and handling Lao collation gets even worse - <P,C,T,V> needs to be
collated as though <C,T,P+V> (or, equivalently for well-formed text,
<C,P+V,T>). (We also need not just 'backwards' as an option for Level 2,
but a rule that a secondary difference before certain breaks takes
precedence over a primary difference after them.)

I spoke above of the ill-effects of soft-dottedness - I appreciate
that not having soft-dottedness causes its own problems.

Richard.
Received on Sun May 20 2012 - 10:19:29 CDT

This archive was generated by hypermail 2.2.0 : Sun May 20 2012 - 10:19:30 CDT