Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 11 Feb 2013 02:45:27 +0100

2013/2/10 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> Order is a problem when one has collating elements composed of multiple
> characters of different non-zero canonical combining classes. In
> practice this could be solved by adding more collating elements, but
> in theory the number of combinations to be considered could be
> unbounded. The UCA defines the interpretation in terms of the NFD
> form, and occasionally it is necessary to reduce strings to NFD form to
> determine this interpretation. Only having to consider primary weights
> can reduce this problem, but it does not always remove the problem.

It's a good point, but this does not break the UCA algorithm itself,
which includes a step at which external preprocessing is possible,
even if NFD helps reducing the number of cases (provided that it does
not strip some prior differences, i.e. when conversion to NFD is
applied *after* the preprocessing step, and in that case the number of
cases to handle during the preprocessing will be higher, and
implementing this preprocessing may be more complex than expected in
some languages).

The term "pathological" could aplpy to these cases where a "naive"
implementation may in fact break the expectations. How then can a
collator become a "conforming" process if it has to differentiate
canonically equivalent input strings ?
Received on Sun Feb 10 2013 - 19:47:22 CST

This archive was generated by hypermail 2.2.0 : Sun Feb 10 2013 - 19:47:23 CST