RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 12 2010 - 17:31:00 CDT

  • Next message: Kenneth Whistler: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

    "Kenneth Whistler" <kenw@sybase.com> wrote:
    > Huh? That is just preprocessing to delete portions of strings
    > before calculating keys. If you want to do so, be my guest,
    > but building in arbitrary rules of content suppression into
    > the UCA algorithm itself is a non-starter.

    I have definitely not asked for adding such preprocessing. My examples
    just gave practical examples where the lowest significant parts of the
    strings are alsmost always at end of the text, in practical cases. And
    there's no reason that a backwards processing will consider them more
    important than the beginning of text.

    I have sufficiently though about it when 2-3 years ago I wanted to fix
    the definitely broken sorts in French Wikitionary categories (notably
    the categories containing full lists of words per language, where
    items were apparently missing just because they were sorted many pages
    away).

    I also added that *storing* additional collation strings introduces a
    significant cost. This cost cannot be ignored, so in practice only the
    leading part of the sort keys/ collation strings will be effectively
    stored.

    Wikipedia for example truncates ALL its sort keys used in categories
    to a limit not exceeding about 64 bytes, even if page names can be up
    to 255 UTF-8 bytes, so that, even NOT ALL the primary differences will
    be stored (it replaces the truncated part by an arbitrary unique
    numeric id which is used only to ensure sort stability and avoid bugs
    when navigating in heavily populated categories, even if this means
    that small groups of entries will not have the optimal order). Note
    that the linguistic sort will effectively be partial, but this is a
    practical case that does occur when sorting long lists.

    Then, during effective sort (using QuickSort or similar, you'll end up
    comparing many pairs of strings multiple times in case of equality,
    without depending on collation strings, and YES of course, you'll
    process the string pairs level by level, and the case where you'll
    find long strings sorting as equal at the primary level is not
    exceptional, so this means that you'll have to reparse a second time
    both strings from their beginning for the next level, even if you
    could have detected very early that there was a secondary difference
    on the first word, without even fully processing the first level on
    the two long strings).

    I've made experimentations, and performance counting (counting the
    number of pairs compared, and the number of times they are need to be
    reread from the backing store, and cumulating their relative distance)
    clearly shows that comparing word per word (all levels for each word
    performed at once before going to the next words) gives a significant
    and measurable advantage due to data locality, but also because you'll
    effectively scan shorter sequences to detect a secondary difference in
    the first words (such as accents) or tertiary differences (such as
    lettercase). And it gives much more consistant results once you are
    using truncated collation strings (and only those truncated strings,
    without addition compares of pairs in additional sort passes).

    The cases where strings will compare equal on the first level is in
    fact EXTREMELY frequent in SQL requests performing master-detail
    joins, ordered by a master key then by detail keys, or when computing
    aggregates (such as SELECT SUM()... GROUP BY master key, or using a
    clause like: HAVING column1=aggregate(column2). This occurs in almost
    all statistical/financial reports. And given that you work for Sybase,
    you should already know that sorting/grouping is the most costly
    operation performed by the SQL engine, where all optimisations are
    expected: multifields collations/sorts is also a very basic operation
    found in nearly all requests (including inserts or unsorted/ungrouped
    selects using unique indice or primary/foreign key indice, because
    indice always have to maintain their collation in an accurate but fast
    way).



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 17:34:26 CDT