Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Fri, 18 May 2012 09:51:34 -0700

Back to first principles.

UCA conformance requires getting the same results as the Main Algorithm.
This can be done easily with NFD input text, or by implementing Step 1
which normalizes the input to NFD. Everything else is a performance
optimization, and there are trade-offs.

We also want collation to be fast, at least for most of normal input.

One of the main performance optimizations is to skip the normalization step
but still get the correct results for most input.

We used to think and write that as long as input strings pass the FCD test,
we will get the correct results. Except, at least for ICU we already
documented in our User Guide that we have an additional limitation -- if
contractions overlap with decomposition mappings, we already said we would
not get the correct results.

On inspection, we think we can do better (and want to), probably by adding
overlap contractions. If we get into trouble with that, we will think of
alternatives. One is to decompose more characters even in FCD input.
Another is to keep documenting a limitation *when normalization is off*.

As for the Tibetan composite vowels, I think it is entirely reasonable to
do like the following:

   - When normalization is *on*, we want to get the right results. This
   will require changes.
      - I think by far the simplest is to always decompose the composite
      vowels, even in FCD input.
      - (Futzing with the ccc value for "blocked" feels like too much of a
      hack.)
   - When normalization is *off*, we make an effort to do the best we can.
   Many sequences with the composite vowels will still come out right, but we
   just document that we get the correct results for most but not all FCD
   strings.

There is nothing that requires us to get correct results *without
normalization* for all FCD strings or any other particular input conditions
(except NFD input). We just want to be fast and correct for the large
majority of inputs, without our users having to think much about it.
(Normalization is on by default for some languages.)

Best regards,
markus

-- 
Google Internationalization Engineering
Received on Fri May 18 2012 - 11:55:29 CDT

This archive was generated by hypermail 2.2.0 : Fri May 18 2012 - 11:55:29 CDT