Re: PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 31 Aug 2011 19:38:19 +0200

2011/8/31 Mark Davis ☕ <mark_at_macchiato.com>:
> It is not so easy.
> Let's suppose that we had a special character to suppress expansions. We'd
> still have to be able to specify what the alternate collation order should
> be. DUCET has the following for æ. Suppressing the expansion would require
> an alternative weighting. What would that be? Like Danish? After 'a'? That
> would require extra structure to have the alternate value.
> 00E6 ; [.15A3.0020.0004.00E6][.0000.015F.0004.00E6][.15FF.0020.001F.00E6] #
> LATIN SMALL LETTER AE; QQKN
> Moreover, it is not even clear that it is a good idea.
> If the collation rules are uniform, then I can expect (say) for Danish to
> always find XæY to sort after XzY in a long list. If the text could contain
> these special characters and change the ordering, then I'd have to look in
> two places, and know what the alternative was, and when it should be used,
> and that the author of the text inserted the right characters, etc.
> Moreover, expansions are not fundamentally different than other cases where
> characters sort differently in different languages.
> So in my view, this is a fringe feature, that would make the algorithms and
> data structures more complex (and thus slower and possibly less robust).

My main concern is not about how the DUCET would be built, but how a
tailoring can be built on top of the DUCET (or from the CLDR "root"
which is different, and will remain different in some documented
cases...) when it already contains these contractions.

Note the effect of the ordering of collation rules: a contraction
changes the behavior of expansions specified in a further tailoring
rule.
You can see a more precise exhibit of this effect in UTS#35 (LDML),
section 5.14.7 (Expansions), when using the "sequence expansion
syntax".

Of course you can avoid this effect using the "normal expansion
syntax" that explicitly separates the base collation element from its
contextual "extension" (this syntax is more complex, and often
completely misunderstood, in addition of requiring more maintenance).

This has already caused me lots of troubles when trying to create a
working tailoring based on the DUCET (or now based on the CLDR root).

In fact I have other more general problems for tailorings, for which I
think I'll design an alternate generic solution: this involves the
tailoring of "variable"collation elements, and the higana-Katakana
exception (they are exactly similar, the goal is to change the level
at which some collation elements are differentiate), and I'm thinking
about unifiying it as well with the tailoring of case differences (if
one wants them to have lower differences than diacritics/variants
differences, and also wants to moce these secondary differences to a
tertiary differences).

These are kinds of tailoring that are currently even impossible to
perform with ICU (or with the current LDML specification,
independantly of the XML or abbreviated syntax). I'm not much
concerned by the XML syntax, I intend to use exclusively the
abbreviated syntax (with some generic additions), simply because it
requires much less maintenance and is much more readable (the XML
syntax is best generated on the fly from the abbreviated syntax by a
simple bot, but most UCA impelmentations will use another more compact
form, based on lookup tables).

-- Philippe.
Received on Wed Aug 31 2011 - 12:40:21 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 31 2011 - 12:40:22 CDT