RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 07 2010 - 19:34:58 CDT

  • Next message: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

    [ snipping all the word breaking discussion, which I am not going
      to comment on ... ]
      
    CE Whitehead said:

    > I collate as follows (note that i' is equivalent to i with accent grave):
    >
    > (EXAMPLE 1 -- my sort)
    > di Silva, Fred,
    > di Silva, John
    > di Si'lva, Fred
    > di Si'lva, John
    > Disilva, Fred
    > Disilva, John

    Which means that you prefer a field-by-field collation for names,
    rather than a merged field collation. But this collation departs
    from the default UCA ordering in other ways. To get these results
    you would have to be treating the presence of the accents as
    a tertiary difference (comparable to the casing differences),
    *and* have a special rule that orders strings with spaces
    ahead of strings with identical primary weighting but without
    spaces.

    The default UCA ordering (with shifted variable weighting) for
    those particular records, if done on a field
    by field basis, would be, instead:

    di Silva, Fred
    di Silva, John
    Disilva, Fred
    Disilva, John
    di Si'lva, Fred
    di Si'lva, John

     
    > and [CE Whitehead does] not [prefer]:
    >
    > (EXAMPLE 2: sort from UAX 10 samples)
    > di Silva, Fred
    > di Si'lva, Fred
    > Disilva, Fred
    > di Silva, John
    > di Si'lva, John
    > Disilva, John

    which is not actually the example shown, but if corrected for
    the weighting of the accent as a secondary and using merged
    columns, would result by UCA default ordering in:

    di Silva, Fred
    Disilva, Fred
    di Si'lva, Fred
    di Silva, John
    Disilva, John
    di Si'lva, John

    But I suppose if you don't like merged column sorting, this
    correction would be equally shocking. ;-)

     
    > (As an 'aside' or unrelated note: I am kind of shocked by the
    > second ordering -- taken from the example in Table 6, section 1.6,
    > of UAX 10,

    Well, not "taken from", but rather adapted from.

    > because that is not how I sort; I suppose the example's purpose
    > is to show across-word-boundary collation,

    Not exactly. The point is to demonstrate multiple-field sorting
    principles for databases. These are not *word* boundaries in
    this case, but distinct columns in a database.

    If one is doing the equivalent of a select like the following:

    select last_name, first_name from clients
    where last_name="disilva"
    order by last_name, first_name

    one expects the first set of results -- if, of course, a
    multilevel collation (with a strength set to ignore
    casing and accents for matching) is defined for the table involved.

    If one is instead doing the equivalent of a select like the
    following (assuming a concatenate operation is defined):

    select last_name, first_name from clients
    where last_name="disilva"
    order by concatenate(last_name, first_name)

    then one expects the second set of results.

    This doesn't mean that one or the other is necessarily correct.
    They are different, and one might prefer one or the other
    for various reasons.

    > but I am still trying to get used to the example; I am
    > apparently what one would call in English a "narrow" person
    > when it comes to collation and sorting.

    You just may not be used to applications that would do the more
    complex operation suggested by that second select statement,
    rather than the easier-to-implement first select statement.

    As for me, if I was trying to find all the "Fred Disilva" records
    in my database, I would certainly prefer the second ordering
    over the first, as it would make the results more immediately
    usable for me.

    > I gather however that the second option is how search engines
    > collate as search engines may treat hyphens as being the same
    > as white space, and two-word and one-word variants of the otherwise
    > same string may be equated too -- just to get more matches in hopes
    > of getting the best one -- which is good because we make mistakes

    That is an entirely separate issue. Search engines tend to
    suppress space and punctuation in matching search *strings*.
    You are talking there about *matching* behavior, not *ordering*,
    and the question really has nothing to do with word boundaries,
    let alone distinct fields in a database.

    > -- but I still cannot accept the sort in Table 6)

    To each his own, I suppose. :-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jul 07 2010 - 19:40:25 CDT