Re: UCA and Russian letter

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 23 Dec 2012 17:37:31 +0100

My opinion is that BOTH the UCA algorithm AND the LDML formal decription of
collations are just "Best known practices" to accomodate the collation
(i.e. dictionary ordering AND string searches AND string comparisons).

But neither of them can accomodate all possible orders or weak comparisons
systems for all languages. Notably, it cannot accomodate directly the
contextual mutation of initial or medial letters in words.

But true linguistic dictionaries have ordered their entries by grouping
together in a single entry or in successive entries all variations of a
word, including these initial mutations, or derivations (like conjugated
verbs, grammatical declinations, plurals, or genders.)

The UCA still works provided that there's a prior preprocessing that allows
infering (or using some lookup for exceptions) another form of words (or of
numbers) for which the multilevel algorithm needed to generate weights can
then work.

The UCA admits it, but LDML cannot describe these proprocessing rules with
just rules used to assign weights to some groups of characters,
independantly of what words could mean in the intended language. So the
LDML collation rules cannot be sufficient for all cases, and there must be
also another formal language for describing the preprocessing rules. For
now this does not exist, but nothing prevent this to appear in some future
as additional data, and LDML could be extended for decribing these
preprocessing transforms.

But some steps will remain : the initial normalization, the place where
letter case is assigned (or not) some higher collation weight. the place
where preprocessing can perform lookups using this simplified view based on
the first steps which perform more than just a standard Unicode
normalization, and then the last steps that are outputing the weights level
per level.

Collation is a very complex concept and it is not fully standardized for
interchange in LDML. The most notable parts being the preprocessing steps.

But note that even various dictionaries for the same language will vary in
how they perform these preprocessings (for exemple a dictionnary may
include and sort separately the derived terms, using a simpler rule not
requiring this preprocessing, so it will have MORE entries, even if they
are linking their actual definition to another entry ; and most
dictionnaries do not include separate entries for derivations like regular
conjugations or plurals or declinations, unless they are VERY irregular ;
and most dictionnaries for languages that include the standard mutation
rules for initial letters will NOT list separate entries for these
mutations and users for example know that if they cannot find a word
starting by these mutable letters, they will look for words starting by
unmutated words).

The same is true for languages that use agglutination : it is not possible
to list all possible agglutinations and users need to know how to recognize
the morphemes. Here again another thing is not described in collation rules
: the breaking rules that allow the separation of words or morphemes. But
collation may not work correctly without it, when agglutination implies
also mutations between two agglutinated morphemes.

This is clearly a limitation of LDML, but not of UCA itself.

2012/12/23 Leo Broukhis <leob_at_mailcom.com>

> On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken <ken.whistler_at_sap.com>
> wrote:
> > Leo Broukhis said:
> >
> >> Granted, not yet, but by itself the argument is invalid. Unicode
> >> collation rules are descriptive;
> >
> > I'm not sure what you mean by that. UTS #10 is a *specification* of an
> algorithm, with various options for tailoring and parameterization which
> make it possible to accommodate various needs for particular cases. It is
> not intended as a descriptive mechanism.
>
> What I meant is that the way its feature set which makes it flexible
> enough had been decided was descriptive, or, if you will, adaptive,
> following the pre-existing collation traditions in various languages
> and/or pre-existing standards.
>
> > Perhaps you are referring to LDML, which includes a formal mechanism for
> describing a particular collation in terms of the default table and
> tailoring options and parameterization options of the UCA.
>
> You're right, LDML may be a better application of the word "descriptive".
>
> >> if, for example, a language happens to sort accents backwards, this
> >> rule has to be - and is - accommodated despite its apparent
> >> illogicality;
> >
> > Backwards accent secondary weighting was actually included primarily
> because of prior art in collation standards, because of the need to be able
> to synchronize the UCA algorithm with ISO 14651, and because it makes it
> easier to explain how folks can implement versions of multi-level
> collation which can pass the conformance tests of the Canadian sorting
> standard, etc.
> >
> >> along the same lines, if a language happens to make a distinction
> >> discussed in this thread, it has to be accommodated just as well.
> >
> > No, I don't think so.
>
> My question can be construed as a hypothetical: had the described
> -collation been a prior art in collation standards by the time of
> development of LDML and the UCA, how different would they have been?
> I'm hoping for an answer "not at all" or "very little", and "here's
> how it could have been implemented: ...".
>
> > It is rather easy to come up with distinctions or collation requirements
> which simply cannot be accommodated within the intended bounds of the UCA.
> For example, sorting all numerical expressions mixed with text strictly by
> their numeric values, or sorting all (or some specified list) of
> abbreviations as if they were spelled out, and so forth.
>
> > Many lexicographical ordering rules cannot be fully accommodated within
> the context of the UCA algorithm, which is a multilevel *string comparison*
> specification, and not a dictionary ordering specification.
>
> That is true in general if rules happen to involve semantics, but
> we're discussing a formal rule here.
> Imagine that the backward accents feature was missing from LDML, e.g.
> because it was an emerging trend rather than a standard way of
> collation at the time of formalizing LDML, thus not included in LDML.
> Would you have said the same about it today if someone had asked about
> supporting it ? If not, why not?
>
> >>
> >> My question is as follows: does UCA have to be modified (e.g. by
> >> adding another bit flag "word-initial primary" next to the existing
> >> "backward secondary") to support the feature if it were to be
> >> implemented, or is there a way to achieve the "new Russian online
> >> collation" within the existing UCA without modifying the strings to
> >> be sorted before the application of the algorithm?
> >
> > I don't think there is any out-of-the-box way to use UCA so that an
> implementation would automatically recognize a word boundary context and
> weight characters conditionally based on that context. So no, I don't think
> you could get an implementation to do that without first marking up text
> with additional characters to indicate word boundaries and then tailoring
> the weight table to weight sequences including that markup accordingly.
>
> My question was narrower: assuming that the strings being compared are
> words, could it be supported without any markup?
> (NB that the "backward accents" feature is also, strictly speaking,
> word-based.)
>
> > But there is another possible sense of the question, "does UCA have to
> be modified... to support...", i.e. is the UTC somehow required to augment
> the algorithm to support some particular kind of behavior for a particular
> language's sorting rules, just because someone has turned up particular odd
> behavior. And I think the answer to that is clearly no. Oh, and by the way,
> I don't think LDML must (or should) be augmented to enable it to describe
> any and all lexicographical ordering practices, either. That isn't the
> function of LDML.
>
> In other words, after adoption, LDML became prescriptive in the sense
> "don't even think of inventing any sorting rules that cannot be
> described by LDML as it stands; we're not going to augment it". The
> Quebecois were very lucky, then.
>
> Leo
>
>
>
Received on Sun Dec 23 2012 - 10:40:06 CST

This archive was generated by hypermail 2.2.0 : Sun Dec 23 2012 - 10:40:07 CST