Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Richard Wordingham <>
Date: Sat, 19 May 2012 01:12:17 +0100

On Fri, 18 May 2012 09:51:34 -0700
Markus Scherer <> wrote:

> On inspection, we think we can do better (and want to), probably by
> adding overlap contractions. If we get into trouble with that, we
> will think of alternatives. One is to decompose more characters even
> in FCD input. Another is to keep documenting a limitation *when
> normalization is off*.

Just in case you haven't already thought of it, one reasonable scheme
would be to decompose input if and only if searching for contractions
or the input character could *hide* the start of a contraction, e.g. one
starting with a combining accent or the non-initial part of an Indic
vowel. One will already have left or be about to leave the 'fast loop',
and of course converting FCD to NFD is easy, as no rearrangement is
required. The contractions that need to be added are merely the
canonical closure of all the explicitly defined contractions, reduced by
requiring that the contraction definition be in NFD after the first
character. This will then work for DUCET 6.1.0, work for Danish, and
work for my mischievous 0302 COMBINING CIRCUMFLEX ACCENT+0067 LATIN
SMALL LETTER G contraction.

Received on Fri May 18 2012 - 19:17:36 CDT

This archive was generated by hypermail 2.2.0 : Fri May 18 2012 - 19:17:37 CDT