Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Jul 12 2010 - 10:31:20 CDT

  • Next message: John H. Jenkins: "Re: Status of Unihan"

    A few comments.

    A tailoring that sorts word-by-word would certainly be possible, and is
    certainly allowed by the UCA. As to whether it is necessary or not, that is
    another matter. Sorting is about matching user expectations, and of all of
    the French that I have ever asked, none except for internationalization
    mavens have ever even noticed or heard that French sorts accents backwards
    at all, let alone that it is word-by-word. And the number of times that
    occurs in actual practice is exceedingly small -- a reason for nobody to
    actually know about it!

    And when we are talking about multiple words, it is even more obscure.
    Remember that for it to make a difference whether done word-by-word or not,
    every single base letter in the two phrases has to be identical, *and* two
    words in different positions have to have accents on different base
    characters. In constructed examples, certainly possible; in real life, rare.

    (Frankly, sometimes I regret our adding French sorting to ICU at all. It
    complicates the code and slows down French sorting significantly, for
    vanishingly small value. However, if you or someone else wanted to add
    word-by-word sorting on top of ICU's implementation, I think the API may be
    sufficient to do that.)

    Mark

    [And, by the way, if you say anything in a reply that you care for me to
    read, please put it in the first couple of paragraphs; I rarely have the
    time to read beyond that in your emails.]

    — Il meglio è l’inimico del bene —

    On Mon, Jul 12, 2010 at 04:28, Philippe Verdy <verdy_p@wanadoo.fr> wrote:

    > "Kenneth Whistler" <kenw@sybase.com> wrote:
    >
    > > [ snipping all the word breaking discussion, which I am not going
    > > to comment on ... ]
    > >
    > > CE Whitehead said:
    > >
    > > > I collate as follows (note that i' is equivalent to i with accent
    > grave):
    > > >
    > > > (EXAMPLE 1 -- my sort)
    > > > di Silva, Fred
    > > > di Silva, John
    > > > di Si'lva, Fred
    > > > di Si'lva, John
    > > > Disilva, Fred
    > > > Disilva, John
    >
    > That's exactly how I would sort it in French, word by word, so all the
    > names with the separated particles would come before. This will work
    > only if level 2 backword reordering of collation weight is limited to
    > word spans.
    >
    > Currently, the UCA algorithm completely ignores this, or does not
    > include any place for introducing a word-breaker to limit the
    > backwards reordering, so effectively, using the French collation with
    > the current algorithm, it would first group them on the primary level
    > (considering only word separation as space, and ignoring case) as:
    >
    > * Group 1 (DI SILVA FRED)
    > > di Si'lva, Fred
    > > di Silva, Fred
    > * Group 2 (DI SILVA JOHN)
    > > di Silva, John
    > > di Si'lva, John
    > * Group 3 (DISILVA FRED)
    > > Disilva, Fred
    > * Group 4 (DISILVA JOHN)
    > > Disilva, John
    >
    > Then the secundary level would split and reorder these groups using
    > the accent differences:
    >
    > * Group 1 (DI SILVA FRED)
    > * di Silva, Fred
    > * di Si'lva, Fred
    > * Group 2 (DI SILVA JOHN)
    > * di Silva, John
    > * di Si'lva, John
    > * Group 3 (DISILVA FRED)
    > * Disilva, Fred
    > * Group 4 (DISILVA JOHN)
    > * Disilva, John
    >
    > (The case differences at the ternary level, and ignorable differences
    > sorted in binary order in the final level have no effect here).
    >
    > The case becomes even more tricky when sorting correctly the following
    > list (note the difference of accents on the second 'e' in groups 1 and
    > 5, as the most significant secondary difference is still on the third
    > 'e', before considering the relative order of accents on the second
    > 'e' ; in this example, I only use lowercase because this only plays a
    > role at the forwards ternary level):
    >
    > 1.
    > * délègue
    > * délégué (the final 'é' sorts after the final 'e')
    > 2.
    > * déléguée
    > 3.
    > * déléguées
    > 4.
    > * déléguer
    > 5.
    > * délègues
    > * délégués (the final 'é' sorts after the final 'e')
    > 6.
    > * délèguent
    >
    > This is a case where the backwards differences plays a role in French,
    > but this only occurs on a word-by-word base, just like in your example
    > above.
    >
    > But if the same list of words is included at end of similar sentences
    > (identical at the primary level) where there the first words would
    > only differ by their accents, these first words should really take
    > precédence to the relative order of the final words listed above.
    >
    > If we don't limit the backwards reordering, then all accents in the
    > full sentences will be reordered, so this is the final word that will
    > drive the order. not only this is incorrect, but having to fully
    > reorder the secondary weights in the full sentence has a significant
    > memory and performance impact, when we could just limit ourself to
    > comparing the sentences word by word, starting from the beginning of
    > each sentence.
    >
    > Such case will often occur when sorting article names (e.g. in
    > Wikipédia), book and movie titles... And when there will be list of
    > people names, where the first (given) name is placed after the last
    > (birth) name, separated by a comma, or when sorting titles with an
    > additional precision at end between parenthèses (such as a country
    > name to disambiguate city names, e.g. in Wikipédia) :
    >
    > These final precisions (added in separate words) clearly have a low
    > priority in the sort order, so the backwards reordering should really
    > not place them with a higher priority in the secondary collation
    > level. If we use a word breaker, not only the sort order will be
    > correct, but we will also gain performance (due to shorter buffering
    > of collation weights, we don't necessarily have to process all the
    > strings completely, but just the beginning most of the time).
    >
    > So when you snipped completely the discussion about word-breaking
    > within collation, you completely forgot the main point of my message:
    > that the UCA algorithm does not include any entry points for insering
    > a word breaker within the algorithm (or any custom breaker that a
    > tailored collation would need, for example in a complex tailoring that
    > would recognize space-separated particles in proper names, as if this
    > space had a lower difference than a true word-separation)
    >
    > Yes using a breaker in UCA collation is clearly optional (and not
    > absolutely needed if all levels are forewards in some language, as it
    > will have no visible effect on the collation order except for
    > increased performance), but it's absolutely essential if we use any
    > backwards reordering at any level.
    >
    > The absence of the word-breaker makes the current UCA algorithm
    > unusable for French in practice (and also very costly in terms of
    > space, and unnecessarily slow), when sorting lists of items longer
    > than a single word : it can only work in practice when sorting basic
    > dictionary entries (for example it works extremely well when sorting
    > French Wiktionary categories, but it fails miserably when sorting
    > French Wikipedia categories, for which the expected backwards ordering
    > of level 2 has to become forewards, even if this gives non-optimal
    > sorts within some small groups of article titles).
    >
    > And that's why I request comments about the effect of the backwards
    > ordering, the way it is currently specified.
    >
    > But also because it can significantly improve the performance of
    > collation, even for English or Korean when comparing sentences (using
    > a syntaxic/lexical/morphologic/syllabic breaker, you just need to
    > compute the collation-strings for each word/syllable/morpheme
    > separately, starting from the beginning, and the breaker can also be
    > used to discard non significant parts of sentence, such as included
    > parts between parentheses).
    >
    > This also generalizes the concept of sorts with compound keys (such as
    > SQL selects with ORDER BY clauses), that implicitly use a column
    > breaker within a result row.
    >
    > -- Philippe.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 10:33:16 CDT