Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 16:50:27 CDT

  • Next message: Philippe Verdy: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

    "Mark Davis ☕" <mark@macchiato.com>

    > A few comments.
    >
    > A tailoring that sorts word-by-word would certainly be possible, and is
    > certainly allowed by the UCA. As to whether it is necessary or not, that is
    > another matter. Sorting is about matching user expectations, and of all of
    > the French that I have ever asked, none except for internationalization
    > mavens have ever even noticed or heard that French sorts accents backwards
    > at all, let alone that it is word-by-word. And the number of times that
    > occurs in actual practice is exceedingly small -- a reason for nobody to
    > actually know about it!
    >
    > And when we are talking about multiple words, it is even more obscure.
    > Remember that for it to make a difference whether done word-by-word or not,
    > every single base letter in the two phrases has to be identical, *and* two
    > words in different positions have to have accents on different base
    > characters. In constructed examples, certainly possible; in real life, rare.

    And such case does occur in large lists (such as when sorting
    categories in Wikipedia as I said)

    > (Frankly, sometimes I regret our adding French sorting to ICU at all. It
    > complicates the code and slows down French sorting significantly, for
    > vanishingly small value. However, if you or someone else wanted to add
    > word-by-word sorting on top of ICU's implementation, I think the API may be
    > sufficient to do that.)

    A basic word-breaker using inly the space separator would marvelously
    improve the speed of French sorting even if backwards ordering occurs,
    just because it would significantly improve the data locality in the
    implementation and would considerably reduce the reallocations and use
    of large buffers.

    A word breaker anyway has a very tiny impact on performance, compared
    to the gain you can have by using smaller data buffers: most of the
    cost of UCA falls within the huge cost of normalizations and complex
    analysis of M-to-N mappings of collation elements, and in
    interpreting/locating the DUCET table entries.

    Remember that the longest word in standard French,
    "anticonstitutionnellement", has no accents and includes only 25
    letters creating, so the effect of backward ordering will be limited
    to a very small buffer for 25 collation weights, instead of possibly
    thousands or millions when sorting large texts (having to sort lists
    of phrases that are longer than 25 characters is definitely not an
    exceptional situation).

    With a word breaker, you can significantly improve the speed of almost
    all the steps of the UCA algorithm (including normalizations and
    looking for M-to-N collation elements), just because you don't have to
    perform at once with the entire text : you can to that on the fly, and
    can stop processing the end of the text very rapidly in almost all
    practical cases (including in English).

    Also remember that collation strings will likely never be stored
    entirely for long texts. They will be truncated to a reasonnable
    maximum size, and if there's still an equality between a few adjascent
    list entries remaining in equality groups, you can still compare them
    more precisely using a UCA compare (because the collation string will
    only contain the leading part of the primary weights), by computing
    the collation levels one by one and progressively.

    To be able to compute truncated collation strings, you just need to
    get sure that you won't break in the middle of a grapheme cluster. A
    basic word-breaker can ensure it very simply, if it just breaks on
    whitespaces or on a few punctuations, even before applying the complex
    normalizations.

    Having to sort very long multi-field texts is also extremely common in
    database queries using SQL's ORDER BY clause, which will benefit a lot
    of being able to pack several leading fields at once in the same
    collation string. Ordered SQL requests are used extremely frequently,
    and the need for performance is not superficial.

    This need for sorting multi-field texts (whose source is already
    broken into separate fields without even using a word breaker) and is
    in fact so frequent in so many applications, that it should be
    described in the standard algorithm : adding a standard entry for
    field separators in the DUCET is very worth the value

    And it would avoid a confusion made by various people, including on
    English Wikipedia when sorting tables where I demonstrated that the
    separator they used, SPACE+ZWNBSP <0020,FEFF>, was definitely broken :
    I insisted to demonstrate that SPACE+EXCLAMATION was a far better
    separator (in fact it is the lowest possible according to
    implementation constraints where controls are forbidden, and a single
    SPACE alone was still not enough in sort keys of very tables tables
    containing people names or toponyms) for multi-field entries occuring
    when sorting Wikipedia tables, even if only a level-1 collation was
    performed, and this was effectively changed after my request).

    Is it really to much to request the inclusion or reservation of a
    standard "empty" collation element for field separators in the DUCET,
    with the lowest primary weight possible, and even lower than the
    primary weight assigned to ignored characters ? (That's why I also
    suggested reserving the negative weight -1, because ignored collation
    elements get a 0 primary weight).

    At least this would effectively document that, when a field separator
    needs to be used used in collation strings or sort keys, it should
    have the lowest weight supported by the implementation and NOT the
    highest weight or a very high one (as it was incorrectly assumed in
    English Wikipedia), even if a full 4-level UCA implementation is not
    used.

    Philippe.



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 16:51:39 CDT