Re: Unicode collation algorithm - interpretation]

From: J M Sykes (mike.sykes@acm.org)
Date: Sun Feb 11 2001 - 12:05:37 EST


Jim,

Thanks for the reply, which Hugh had indeed alerted me to expect. See
interpolations below.

> I particularly want to respond to the statement that you made:
>
> >It has been suggested that SQL <collation name> should instead identify
> >both collation element table and maximum level.
>
> I believe that the "maximum level" is built into the
> collation element table inseparably.

I think you misunderstand me. The "maximum level" I was referring to is that
mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for
each string", para 2, which reads:

<quote>
An implementation may allow the maximum level to be set to a smaller level
than the available levels in the collation element array. For example, if
the maximum level is set to 2, then level 3 and higher weights (including
the normalized Unicode string) are not appended to the sort key. Thus any
differences at levels 3 and higher will be ignored, leveling any such
differences in string comparison.
</quote>

There is, of course, an upper limit to the number of levels provided for in
the Collation Element Table and 14651 requires that "The number of levels
that the process supports ... shall be at least three." So I think it's fair
to say we are discussing whether, and if so how, these levels should be made
visible to the SQL user.

We can safely assume that at least some users will require sometimes exact,
sometimes inexact comparisons (at least for pseudo-equality, to a lesser
extent for sorting).

We can also safely assume that users will wish to get the performance
benefit of some preprocessing.

It is clearly possible to preprocess as far as the end of step 2 of the
Unicode collation algorithm without committing to a level. I understand you
to say that several implementors have concluded that this level of
preprocessing is not cost-effective, in comparison to going all the way to
the sort key. I am in no position to dispute that conclusion.

> I monitored the email discussions rather a lot during the development of
> ISO 14651 and it seemed awfully likely as a result of the
> discussions (plus conversations that I've had with implementors in
> at least 3 companies) that
> a specific collation would be built by constructing the collation
> element table (as you mentioned in your note) and then "compiling"
> it into the code that actually does the collation.
> That code would *inherently* have built
> into it the levels that were specified in the collation table that was
> constructed. It's not like the code can pick and choose which of the
> levels it wishes to honor.

I'm afraid I don't understand what this is saying. I've seen both the 104651
"Common Template Table" and the Unicode "Default Unicode Collation Element
Table", and assume them to be equivalent, but have not verified that they
are. Neither of them looks particularly "compilable" to me but, in view of
your quotes, I'm not at all clear what you mean by '"compiling" it into the
code that actually does the collation.'

I'm also unclear what an SQL-implementor is likely to supply as "a
collation", though I imagine (only!) that it might be a part only of the
CTT/CET appropriate to the script used by a particular culture, and with
appropriate tailoring. But I have no reason to expect the executable
("compiled"?) code the implements the algorithm to vary depending on the
collation, or on the level (case-blind &c) specified by the user for a
particular comparison.

I find it easier to imagine differences in code depending on whether a
<collate clause> is in a <column definition> or in, say,
WHERE C1 = C2 COLLATE <collation name>.

> Of course, if you really want to specify an SQL collation name that
> somehow identifies 2 or 3 or 4 (or more) collations built in
> conformance with ISO
> 14651 and then use an additional parameter to choose between them, I guess
> that's possible (but not, IMHO, desirable).

Unless you mean for performance reasons, I'd be interested to know why not
desirable.

> However, it would be very
> difficult to enforce a rule that says that the collection of collations so
> identified are "the same" except for the level chosen. One could be
> oriented towards, say, French, and the other towards German or Thai and it
> would be very hard for the SQL engine to know that it was being misled.

I can see a problem in ensuring that COLLATE (Collate_Fr_Fr, 2) bears the
same relation to COLLATE (Collate_Fr_Fr, 1) as COLLATE (Collate_Thai, 2)
bears to COLLATE (Collate_Thai, 1), but I honestly don't know how
significant that is, or even what "the same" ought to mean if Thai has no
cases or diacritics anyway.

This seems almost to be questioning the usefulness of levels. Perhaps they
have values for some cultures but not others. If that's the case, I don't
see that my suggestion is completely invalidated, though it's value might be
so seriously reduced as to make it negligible.

Mike.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT