Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Kenneth Whistler (
Date: Mon Jul 12 2010 - 20:16:00 CDT

  • Next message: Tulasi: "Re: Bengali Script"

    Philippe Verdy said:

    > A basic word-breaker using inly the space separator would marvelously
    > improve the speed of French sorting even if backwards ordering occurs,
    > just because it would significantly improve the data locality in the
    > implementation and would considerably reduce the reallocations and use
    > of large buffers.

    I disagree. Any implementation of an incremental comparison based on
    UCA which is making significant use of reallocations and large buffers
    isn't optimized in the first place.

    An optimized incremental comparison doesn't consist of breaking
    out a chunk of string 1, computing a complete sort key for it,
    breaking out a chunk of string 2, computing a complete sort key for
    it, and then comparing the keys.

    An optimized incremental comparison walks down both string 1 and
    string 2, pulling up the relevant comparable collation elements
    and tracking state for primary, secondary, and tertiary weights
    as it goes, bailing with an answer at the first available point
    that the comparison becomes definitively non-equal. That can be
    done without *any* buffer allocation or reallocation, using
    relatively tiny buffers allocated on the stack, and definitely
    has better "data locality" than first trying to chunk strings at
    word boundaries.

    > A word breaker anyway has a very tiny impact on performance, compared
    > to the gain you can have by using smaller data buffers: most of the
    > cost of UCA falls within the huge cost of normalizations and complex
    > analysis of M-to-N mappings of collation elements, and in
    > interpreting/locating the DUCET table entries.

    That much I *do* agree with, however. If you have to normalize
    entire strings before starting comparison, that is a significant
    processing cost. The answer to that is architectural -- if you
    have to repeatedly normalize strings before comparing them,
    then store the normalized strings and compare *those* instead.
    Also, even in the case where dealing with "wild" data input that
    cannot be assumed to be pre-normalized, a test to check whether
    a string is in NFC is *much* quicker than actually doing the
    normalization, and for most data the answer is yes, anyway.

    As for the M-to-N mappings and the collation element lookup,
    that is unavoidable, however else you build the comparison, so
    you just do the best job to make the lookup of collation
    elements fast.

    But your argument that a word breaker "has a very tiny impact
    on performance" compared to normalization or collation element
    lookup is moot, if the addition of the word breaker isn't
    going to improve the incremental comparison, anyway -- which
    is my argument. And unless your word breaker really is
    a trivial one (i.e. break on spaces), it isn't correct to
    maintain that it is a "tiny impact" compared to normalization,
    for example.

    > Remember that the longest word in standard French,
    > "anticonstitutionnellement", has no accents and includes only 25
    > letters creating, so the effect of backward ordering will be limited
    > to a very small buffer for 25 collation weights, instead of possibly
    > thousands or millions when sorting large texts ...

    Or *zero* bytes for extra buffers, with a properly designed
    incremental comparison algorithm -- even if the strings to compare
    are megabytes long.

    [ The remainder of this email has been truncated, as the string
    was too long. ;-) ]


    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 20:19:08 CDT