RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 12 2010 - 16:06:37 CDT

  • Next message: announcements@unicode.org: "New ways to follow Unicode"

    Philippe Verdy said:

    > If we don't limit the backwards reordering, then all accents in the
    > full sentences will be reordered, so this is the final word that will
    > drive the order. not only this is incorrect,

    I understand that you think that the ordering should be done
    word-by-word, with the French accent consideration being done
    backward only within the scope of the individual word comparisons, but...

    > ... but having to fully
    > reorder the secondary weights in the full sentence has a significant
    > memory and performance impact, when we could just limit ourself to
    > comparing the sentences word by word, starting from the beginning of
    > each sentence.

    That claim is simply wrong.

    A. If you are *storing* keys, then the memory size of the resultant
    keys is no larger for a (well-designed) backwards French accent
    collation, than for forward accent collation. And then the comparison
    time for binary comparison of the full-formed keys is also no different,
    on average.

    B. If you are doing *incremental* comparisons, then it would be
    foolish to construct entire keys and scan all the way to the
    end of long strings looking for accents to reverse, before
    making comparison decisions based on primary weights at the
    beginning of strings. Any well-designed incremental UCA string
    comparison should be checking primary values distinctly from
    secondary values *incrementally*, so that a difference, say,
    between "d" and "e" in the first character does not wait for
    determination of the presence or absence of an accent at the
    end of a string -- perhaps 250 positions down the string or more --
    before deciding the order relation between the two strings.

    Yes, in the incremental comparison case, there *is* an additional
    cost for processing for backwards accents, but it only tends
    to be significant for the relatively unusual cases of comparison
    of strings which are all equal at the primary level and which only
    differ by accents.

    And doing string compares word by word introduces its own level
    of complication into comparison operations, so it is not at
    all obvious that processing that way would improve performance
    over an incremental comparison without word boundaries.

    > Such case will often occur when sorting article names (e.g. in
    > Wikipédia), book and movie titles... And when there will be list of
    > people names, where the first (given) name is placed after the last
    > (birth) name, separated by a comma, or when sorting titles with an
    > additional precision at end between parenthèses (such as a country
    > name to disambiguate city names, e.g. in Wikipédia) :

    Names of people in comma-delimited lists, possibly.

    But I challenge you to provide some real world examples of
    your last case -- where you have some pair of city names
    that differ *only* by accent placement, disambiguated by
    a following parenthetical country (or region, or whatever, I
    don't care) name, where the country names *also* only
    differ by accent placement. And if you come up with such
    an example, explain how likely it is that anyone using such
    a list, amongst a list of thousands others like it, would
    be to notice that there was a backwards accent weighting
    issue in the order of the two entries.

    > These final precisions (added in separate words) clearly have a low
    > priority in the sort order, so the backwards reordering should really
    > not place them with a higher priority in the secondary collation
    > level.

    Theoretically, perhaps. But practically this will make no
    difference, I predict.

    > If we use a word breaker, not only the sort order will be
    > correct, but we will also gain performance (due to shorter buffering
    > of collation weights, we don't necessarily have to process all the
    > strings completely, but just the beginning most of the time).

    Not true. See the discussion above.

    > So when you snipped completely the discussion about word-breaking
    > within collation, you completely forgot the main point of my message:

    I will remind you that I was not responding to *you*, Philippe, when
    I responded to CE Whitehead. So I wasn't forgetting your main point
    at all -- just not responding to it.

    > that the UCA algorithm does not include any entry points for insering
    > a word breaker within the algorithm (or any custom breaker that a
    > tailored collation would need, for example in a complex tailoring that
    > would recognize space-separated particles in proper names, as if this
    > space had a lower difference than a true word-separation)

    Yes, we can take it as stipulated that the UCA algorithm
    "does not include any entry points for inser[t]ing
    a word breaker within the algorithm." That was by design,
    as the algorithm is designed for collation of strings --
    not for word-by-word collation of delimited, structured text.

    > Yes using a breaker in UCA collation is clearly optional

    Yes, anyone who wants to enhance UCA collation to implement
    collation that takes into account word-by-word behavior can
    clearly do so.

    > (and not
    > absolutely needed if all levels are forewards in some language, as it
    > will have no visible effect on the collation order except for
    > increased performance), but it's absolutely essential if we use any
    > backwards reordering at any level.

    I disagree with that assessment.

    > The absence of the word-breaker makes the current UCA algorithm
    > unusable for French in practice (and also very costly in terms of
    > space, and unnecessarily slow), when sorting lists of items longer
    > than a single word

    Figures, please, to back up that claim.

    > : it can only work in practice when sorting basic
    > dictionary entries (for example it works extremely well when sorting
    > French Wiktionary categories, but it fails miserably when sorting
    > French Wikipedia categories, for which the expected backwards ordering
    > of level 2 has to become forewards, even if this gives non-optimal
    > sorts within some small groups of article titles).

    Quantify "fails miserably", when the correctness issue you
    cite would only apply in cases in comparison of strings which
    otherwise have no primary differences at all.

    > And that's why I request comments about the effect of the backwards
    > ordering, the way it is currently specified.

    What comments are you requesting, exactly? If they are along the lines
    of what you have just claimed in this note, then I can predict
    you are going to find considerable disagreement.
     
    > But also because it can significantly improve the performance of
    > collation, even for English or Korean when comparing sentences (using
    > a syntaxic/lexical/morphologic/syllabic breaker, you just need to
    > compute the collation-strings for each word/syllable/morpheme
    > separately, starting from the beginning,

    Again, I don't know where you are getting these claims. If you
    are computing whole keys for storage, there are no significant
    differences.

    If you are doing *incremental* string comparison, such an algorithm
    *already* is optimized to make order determinations as early in
    a string as possible.

    If you force word-by-word comparison, you are likely to actually
    *slow down* string comparison, as compared to an optimized
    incremental string comparison, as you then need to parse off
    each unit (word or whatever, by whatever criterion), then
    compute a key for each pair, and compare the keys. That is
    significantly *more* processing than a well-optimized incremental
    string comparison. If you think otherwise, then I'd suggest
    you provide an actual algorithm for doing the kind of processing
    you envision here, and let people analyze the algorithm to see
    if it could actually out-perform incremental string comparisons
    for the same input.

    I'm not denying that there are cases where word-by-word (or
    other kinds of unit) comparisons are actually desirable -- I
    just don't buy your argument that this is a route for improved
    comparison speed over UCA, even with French rules for backwards
    accent comparison.

    > and the breaker can also be
    > used to discard non significant parts of sentence, such as included
    > parts between parentheses).

    Huh? That is just preprocessing to delete portions of strings
    before calculating keys. If you want to do so, be my guest,
    but building in arbitrary rules of content suppression into
    the UCA algorithm itself is a non-starter.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2010 - 16:08:16 CDT