RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 07 2010 - 19:34:58 CDT

Next message: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

Previous message: Asmus Freytag: "Re: charset parameter in Google Groups"
Maybe in reply to: Philippe Verdy: "UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Reply: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

[ snipping all the word breaking discussion, which I am not going
to comment on ... ]

CE Whitehead said:

> I collate as follows (note that i' is equivalent to i with accent grave):
>
> (EXAMPLE 1 -- my sort)
> di Silva, Fred,
> di Silva, John
> di Si'lva, Fred
> di Si'lva, John
> Disilva, Fred
> Disilva, John

Which means that you prefer a field-by-field collation for names,
rather than a merged field collation. But this collation departs
from the default UCA ordering in other ways. To get these results
you would have to be treating the presence of the accents as
a tertiary difference (comparable to the casing differences),
*and* have a special rule that orders strings with spaces
ahead of strings with identical primary weighting but without
spaces.

The default UCA ordering (with shifted variable weighting) for
those particular records, if done on a field
by field basis, would be, instead:

di Silva, Fred
di Silva, John
Disilva, Fred
Disilva, John
di Si'lva, Fred
di Si'lva, John

> and [CE Whitehead does] not [prefer]:
>
> (EXAMPLE 2: sort from UAX 10 samples)
> di Silva, Fred
> di Si'lva, Fred
> Disilva, Fred
> di Silva, John
> di Si'lva, John
> Disilva, John

which is not actually the example shown, but if corrected for
the weighting of the accent as a secondary and using merged
columns, would result by UCA default ordering in:

di Silva, Fred
Disilva, Fred
di Si'lva, Fred
di Silva, John
Disilva, John
di Si'lva, John

But I suppose if you don't like merged column sorting, this
correction would be equally shocking. ;-)

> (As an 'aside' or unrelated note: I am kind of shocked by the
> second ordering -- taken from the example in Table 6, section 1.6,
> of UAX 10,

Well, not "taken from", but rather adapted from.

> because that is not how I sort; I suppose the example's purpose
> is to show across-word-boundary collation,

Not exactly. The point is to demonstrate multiple-field sorting
principles for databases. These are not *word* boundaries in
this case, but distinct columns in a database.

If one is doing the equivalent of a select like the following:

select last_name, first_name from clients
where last_name="disilva"
order by last_name, first_name

one expects the first set of results -- if, of course, a
multilevel collation (with a strength set to ignore
casing and accents for matching) is defined for the table involved.

If one is instead doing the equivalent of a select like the
following (assuming a concatenate operation is defined):

select last_name, first_name from clients
where last_name="disilva"
order by concatenate(last_name, first_name)

then one expects the second set of results.

This doesn't mean that one or the other is necessarily correct.
They are different, and one might prefer one or the other
for various reasons.

> but I am still trying to get used to the example; I am
> apparently what one would call in English a "narrow" person
> when it comes to collation and sorting.

You just may not be used to applications that would do the more
complex operation suggested by that second select statement,
rather than the easier-to-implement first select statement.

As for me, if I was trying to find all the "Fred Disilva" records
in my database, I would certainly prefer the second ordering
over the first, as it would make the results more immediately
usable for me.

> I gather however that the second option is how search engines
> collate as search engines may treat hyphens as being the same
> as white space, and two-word and one-word variants of the otherwise
> same string may be equated too -- just to get more matches in hopes
> of getting the best one -- which is good because we make mistakes

That is an entirely separate issue. Search engines tend to
suppress space and punctuation in matching search *strings*.
You are talking there about *matching* behavior, not *ordering*,
and the question really has nothing to do with word boundaries,
let alone distinct fields in a database.

> -- but I still cannot accept the sort in Table 6)

To each his own, I suppose. :-)

--Ken

Next message: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Previous message: Asmus Freytag: "Re: charset parameter in Google Groups"
Maybe in reply to: Philippe Verdy: "UTS#10 (collation) : French backwards level 2, and word-breakers."
Next in thread: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Reply: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jul 07 2010 - 19:40:25 CDT