RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 06:28:11 CDT

  • Next message: John Burger: "Re: Bengali Script"

    "Kenneth Whistler" <kenw@sybase.com> wrote:

    > [ snipping all the word breaking discussion, which I am not going
    > to comment on ... ]
    >
    > CE Whitehead said:
    >
    > > I collate as follows (note that i' is equivalent to i with accent grave):
    > >
    > > (EXAMPLE 1 -- my sort)
    > > di Silva, Fred
    > > di Silva, John
    > > di Si'lva, Fred
    > > di Si'lva, John
    > > Disilva, Fred
    > > Disilva, John

    That's exactly how I would sort it in French, word by word, so all the
    names with the separated particles would come before. This will work
    only if level 2 backword reordering of collation weight is limited to
    word spans.

    Currently, the UCA algorithm completely ignores this, or does not
    include any place for introducing a word-breaker to limit the
    backwards reordering, so effectively, using the French collation with
    the current algorithm, it would first group them on the primary level
    (considering only word separation as space, and ignoring case) as:

    * Group 1 (DI SILVA FRED)
    > di Si'lva, Fred
    > di Silva, Fred
    * Group 2 (DI SILVA JOHN)
    > di Silva, John
    > di Si'lva, John
    * Group 3 (DISILVA FRED)
    > Disilva, Fred
    * Group 4 (DISILVA JOHN)
    > Disilva, John

    Then the secundary level would split and reorder these groups using
    the accent differences:

    * Group 1 (DI SILVA FRED)
      * di Silva, Fred
      * di Si'lva, Fred
    * Group 2 (DI SILVA JOHN)
      * di Silva, John
      * di Si'lva, John
    * Group 3 (DISILVA FRED)
      * Disilva, Fred
    * Group 4 (DISILVA JOHN)
      * Disilva, John

    (The case differences at the ternary level, and ignorable differences
    sorted in binary order in the final level have no effect here).

    The case becomes even more tricky when sorting correctly the following
    list (note the difference of accents on the second 'e' in groups 1 and
    5, as the most significant secondary difference is still on the third
    'e', before considering the relative order of accents on the second
    'e' ; in this example, I only use lowercase because this only plays a
    role at the forwards ternary level):

    1.
       * délègue
       * délégué (the final 'é' sorts after the final 'e')
    2.
       * déléguée
    3.
       * déléguées
    4.
       * déléguer
    5.
       * délègues
       * délégués (the final 'é' sorts after the final 'e')
    6.
       * délèguent

    This is a case where the backwards differences plays a role in French,
    but this only occurs on a word-by-word base, just like in your example
    above.

    But if the same list of words is included at end of similar sentences
    (identical at the primary level) where there the first words would
    only differ by their accents, these first words should really take
    precédence to the relative order of the final words listed above.

    If we don't limit the backwards reordering, then all accents in the
    full sentences will be reordered, so this is the final word that will
    drive the order. not only this is incorrect, but having to fully
    reorder the secondary weights in the full sentence has a significant
    memory and performance impact, when we could just limit ourself to
    comparing the sentences word by word, starting from the beginning of
    each sentence.

    Such case will often occur when sorting article names (e.g. in
    Wikipédia), book and movie titles... And when there will be list of
    people names, where the first (given) name is placed after the last
    (birth) name, separated by a comma, or when sorting titles with an
    additional precision at end between parenthèses (such as a country
    name to disambiguate city names, e.g. in Wikipédia) :

    These final precisions (added in separate words) clearly have a low
    priority in the sort order, so the backwards reordering should really
    not place them with a higher priority in the secondary collation
    level. If we use a word breaker, not only the sort order will be
    correct, but we will also gain performance (due to shorter buffering
    of collation weights, we don't necessarily have to process all the
    strings completely, but just the beginning most of the time).

    So when you snipped completely the discussion about word-breaking
    within collation, you completely forgot the main point of my message:
    that the UCA algorithm does not include any entry points for insering
    a word breaker within the algorithm (or any custom breaker that a
    tailored collation would need, for example in a complex tailoring that
    would recognize space-separated particles in proper names, as if this
    space had a lower difference than a true word-separation)

    Yes using a breaker in UCA collation is clearly optional (and not
    absolutely needed if all levels are forewards in some language, as it
    will have no visible effect on the collation order except for
    increased performance), but it's absolutely essential if we use any
    backwards reordering at any level.

    The absence of the word-breaker makes the current UCA algorithm
    unusable for French in practice (and also very costly in terms of
    space, and unnecessarily slow), when sorting lists of items longer
    than a single word : it can only work in practice when sorting basic
    dictionary entries (for example it works extremely well when sorting
    French Wiktionary categories, but it fails miserably when sorting
    French Wikipedia categories, for which the expected backwards ordering
    of level 2 has to become forewards, even if this gives non-optimal
    sorts within some small groups of article titles).

    And that's why I request comments about the effect of the backwards
    ordering, the way it is currently specified.

    But also because it can significantly improve the performance of
    collation, even for English or Korean when comparing sentences (using
    a syntaxic/lexical/morphologic/syllabic breaker, you just need to
    compute the collation-strings for each word/syllable/morpheme
    separately, starting from the beginning, and the breaker can also be
    used to discard non significant parts of sentence, such as included
    parts between parentheses).

    This also generalizes the concept of sorts with compound keys (such as
    SQL selects with ORDER BY clauses), that implicitly use a column
    breaker within a result row.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 06:32:14 CDT