RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 12 2010 - 18:50:47 CDT

  • Next message: Kenneth Whistler: "Re: UTS#10 (collation) : French backwards level 2, and word-breakers."

    Philippe Verdy wrote:

    > "Kenneth Whistler" <kenw@sybase.com> wrote:
    > > Huh? That is just preprocessing to delete portions of strings
    > > before calculating keys. If you want to do so, be my guest,
    > > but building in arbitrary rules of content suppression into
    > > the UCA algorithm itself is a non-starter.
    >
    > I have definitely not asked for adding such preprocessing.

    No, effectively you just did, by asking for special weighting for
    things like parentheticals at the end of strings.

    > My examples
    > just gave practical examples where the lowest significant parts of the
    > strings are alsmost always at end of the text, in practical cases.

    And the point of that is what? If the most significant part of a
    string is at the front, that is the part that will get processed
    first by incremental comparison, anyway.

    > And
    > there's no reason that a backwards processing will consider them more
    > important than the beginning of text.

    And you are completely missing the point that in order to handle
    French accent backwards weighting correctly, you don't need to
    process the entire string backwards. You can do this correctly
    using an incremental comparison from the *front* (by your own
    assessment, more significant) part of the string, if you build
    the incremental comparison algorithm carefully.

    > I have sufficiently though about it when 2-3 years ago I wanted to fix
    > the definitely broken sorts in French Wikitionary categories (notably
    > the categories containing full lists of words per language, where
    > items were apparently missing just because they were sorted many pages
    > away).

    O.k. so somebody was goofing up on the sorting, and it is a good
    thing that you fixed it.

    > I also added that *storing* additional collation strings introduces a
    > significant cost. This cost cannot be ignored, so in practice only the
    > leading part of the sort keys/ collation strings will be effectively
    > stored.

    Yes, but what does a decision about storing truncated sort keys have
    to do with the issue of French accent processing?

    > Wikipedia for example truncates ALL its sort keys used in categories
    > to a limit not exceeding about 64 bytes, even if page names can be up
    > to 255 UTF-8 bytes, so that, even NOT ALL the primary differences will
    > be stored ...

    In which case, for those longer strings, you are going to have to
    do disambiguating compares to recover any secondary differences --
    and it doesn't matter whether you are talking about German accented
    characters or French accented characters.

    > Then, during effective sort (using QuickSort or similar, you'll end up
    > comparing many pairs of strings multiple times in case of equality,
    > without depending on collation strings, and YES of course, you'll
    > process the string pairs level by level, and the case where you'll
    > find long strings sorting as equal at the primary level is not
    > exceptional, so this means that you'll have to reparse a second time
    > both strings from their beginning for the next level,

    A requirement that is forced by deciding to store truncated sort keys,
    *regardless* of how you are handling the secondary weights.

    > ... even if you
    > could have detected very early that there was a secondary difference
    > on the first word, without even fully processing the first level on
    > the two long strings).

    And my point continues to be (which you don't seem to understand)
    that a properly designed incremental comparison will do exactly
    that.

    > I've made experimentations, and performance counting (counting the
    > number of pairs compared, and the number of times they are need to be
    > reread from the backing store, and cumulating their relative distance)
    > clearly shows that comparing word per word (all levels for each word
    > performed at once before going to the next words) gives a significant
    > and measurable advantage due to data locality, but also because you'll
    > effectively scan shorter sequences to detect a secondary difference in
    > the first words (such as accents) or tertiary differences (such as
    > lettercase).

    And all those calculations apparently depend on mistaken assumptions
    about how incremental multi-level collation comparisons can be
    constructed.

    > And it gives much more consistant results once you are
    > using truncated collation strings (and only those truncated strings,
    > without addition compares of pairs in additional sort passes).

    Truncated stored sort keys?

    Or truncated stored strings?

    If you truncate the sort keys, then you access the original strings
    and compare them (incrementally) to do the tie-breaking.

    If you truncate the stored strings, then you are out of luck if they
    compare equal, because you've truncated off the parts that might
    distinguish them.

    > The cases where strings will compare equal on the first level is in
    > fact EXTREMELY frequent in SQL requests performing master-detail
    > joins, ordered by a master key then by detail keys, or when computing
    > aggregates (such as SELECT SUM()... GROUP BY master key, or using a
    > clause like: HAVING column1=aggregate(column2). This occurs in almost
    > all statistical/financial reports.

    Yes. But are you claiming that has some relevance to your
    Wikipedia case? If the structure of topics in Wikipedia is such
    that too many of the truncated stored keys are comparing equal,
    then somebody at Wikipedia is making the wrong tradeoffs between
    structuring topic strings and truncating the stored keys for them.

    > And given that you work for Sybase,
    > you should already know that sorting/grouping is the most costly
    > operation performed by the SQL engine, where all optimisations are
    > expected: multifields collations/sorts is also a very basic operation
    > found in nearly all requests (including inserts or unsorted/ungrouped
    > selects using unique indice or primary/foreign key indice, because
    > indice always have to maintain their collation in an accurate but fast
    > way).

    Yes. Hence the need for highly optimized incremental string compares
    which otherwise fit the optimization strategies needed for data joins.

    You certainly don't get that by requiring that a processing step
    be in place that actually processes the entire comparand string
    from the *end* of the string in order to handle accents correctly.
    Nor would it be helped, IMO, by putting in place word boundary
    processing as part of the string compares: that would only result
    in the introduction of further complication and all kinds of
    ambiguity about the edge cases where word boundaries might or might
    not line up with fields for the joins.

    The engineering requirement is for the fastest possible comparison
    of two strings A and B, to determine whether they are less than,
    equal to, or greater than, and additionally, in the case of
    A less than B, whether A is a *prefix* of B (a fact which is used
    in optimization strategies to deal with your cases like
    statistical/financial reports, with long identical prefixes on
    compared strings). In my opinion, that can be done in the context
    of UCA as currently defined -- even accounting for French accents
    for a French tailoring -- quite efficiently, and would not be
    improved by putting in place word parsing of strings as part of
    the collation.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 18:54:28 CDT