Re: PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Wed, 31 Aug 2011 10:16:36 -0700

It is not so easy.

Let's suppose that we had a special character to suppress expansions. We'd
still have to be able to specify what the alternate collation order should
be. DUCET has the following for æ. Suppressing the expansion would require
an alternative weighting. What would that be? Like Danish? After 'a'? That
would require extra structure to have the alternate value.

00E6 ; [.15A3.0020.0004.00E6][.0000.015F.0004.00E6][.15FF.0020.001F.00E6] #
LATIN SMALL LETTER AE; QQKN

Moreover, it is not even clear that it is a good idea.

If the collation rules are uniform, then I can expect (say) for Danish to
always find XæY to sort after XzY in a long list. If the text could contain
these special characters and change the ordering, then I'd have to look in
two places, and know what the alternative was, and when it should be used,
and that the author of the text inserted the right characters, etc.
Moreover, expansions are not fundamentally different than other cases where
characters sort differently in different languages.

So in my view, this is a fringe feature, that would make the algorithms and
data structures more complex (and thus slower and possibly less robust).

Mark
*— Il meglio è l’inimico del bene —*

On Wed, Aug 31, 2011 at 10:03, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2011/8/31 Mark Davis ☕ <mark_at_macchiato.com>:
> >> Another interesting question is: how can we encode in texts the fact
> >> that a character usually considered as a ligature in a language (that
> >> collates it as separate letters, even if the ligature is orthographic
> >> and not just typographic), should still be collated as only one letter
> >> ? In other words, are there some controls (or variant selection, or
> >> other means) which would have the effect of disabling the default
> >> expansions performed in a correctly tailored collation (for example,
> >> in a French collator is there a way to disable the expansion of
> >> occurences of "æ" into "ae" ?
> >
> > The CLDR tailoring syntax allows DUCET expansions to be suppressed or
> > changed for a particular locale.
> > There is no mechanism in UCA to change expansions on a code-point basis.
> Eg
> > in the same string "Cæsium Kværner" to have the first æ expand to 'ae'
> but
> > the second sort after 'Z', as in Norwegian.
>
> You just provided the perfect example: we still have no way to specify
> that one of the 'æ' occurence should not be expanded, and the other
> one should be.
>
> I would expect an orthographic convention, such as adding an invisible
> control after one of the occurences to change the default behavior
> *locally*, so that it could be detected by an UCA tailoring (using the
> rule of longer match). But which kind of invisible control? May be the
> occurence that should expand could be encoded as (a,ZWJ,e), and in
> that case there is no more expansion for this substring, but just an
> ignorable character in the middle...
>
> -- Philippe.
>
Received on Wed Aug 31 2011 - 12:17:49 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 31 2011 - 12:17:50 CDT