Fw: Unicode collation algorithm - interpretation]

From: J M Sykes (mike.sykes@acm.org)
Date: Sat Feb 10 2001 - 13:26:38 EST


Because I have not received a copy of the following via the Unicode List, I
have assumed the sender (who is probably well known to at least some as
editor of the SQL standard) may not currently be a member of the list. Since
he clearly intended this message to go to the list, and because it is
relevant to a question I posted earlier, I hope to be forgiven for taking
the liberty of forwarding it.

Mike.

----- Original Message -----
From: "Jim Melton" <jim.melton@acm.org>
To: "J M Sykes" <mike.sykes@acm.org>; "Unicode List" <unicode@unicode.org>
Cc: "Fred Zemke" <fred.zemke@oracle.com>; <michael.yau@oracle.com>;
<rjenkins@us.oracle.com>; <jim.melton@acm.org>
Sent: Saturday, February 10, 2001 12:34 AM
Subject: [Fwd: Unicode collation algorithm - interpretation]

Mike,

In a message that you sent to the Unicode list on 8 February, you addressed
the question of parameterized invocations of collations:

Date: Thu, 8 Feb 2001 04:49:21 -0800 (GMT-0800)
>In the proposal for better accommodating UCS in SQL, we assumed that a
>comparison performed according to UTR#10, "Unicode Technical Standard
>#10
>Unicode Collation Algorithm", would require four parameters, viz.
>
> Two strings to be compared
>
> A collation element table
>
> A maximum level as mentioned in UTR#10, section 4.3
> "Form a sort key for each string", which specifies Step 3.
>
>SQL already uses the term 'collation', each of which is identified by a
><collation name>, but does not accommodate the notion that the same
>collation element table can be applied at different levels.
>
>In our proposal, we have assumed that <collation name> identifies a
>collation element table, and have extended SQL syntax to allow the user
>to
>specify the fourth parameter (or leave it to be defaulted).
>
>It has been suggested that SQL <collation name> should instead identify
>both
>collation element table and maximum level.
>
>Perhaps the second approach might be useful in the case where, for
>reasons
>of performance, sort keys are constructed in advance of being needed,
>for
>example to be stored as 'shadow columns' in SQL base tables, or in
>indexes.
>
>On the other hand, the first approach seems to be more user-friendly in
>the
>case where at least two collation element tables are available, provided
>their levels correspond (i.e. provided level 2 means 'case-blind' in
>both
>cases).
>
>Would anyone care to comment?

Indeed, I would.

I think you probably were told by Hugh Darwen that he had spoken to me and
that I stated that I thought it unlikely that the code written to implement
the Unicode collation algorithm (more particularly, code written to
implement ISO 14651, the collating standard) would be parameterized to
allow specification of different levels.

I particularly want to respond to the statement that you made:

>It has been suggested that SQL <collation name> should instead identify
>both
>collation element table and maximum level.

In this statement, your wording makes it appear that the suggestion was
based on some matter of personal taste or something similarly refutable. I
did not respond to this aspect of your draft proposal on the basis of any
whimsy, but on the basis that I do not believe that it is technically
appropriate, even if we can somehow coerce bits of technology into making
this happen. In fact, I believe that the "maximum level" is built into the
collation element table inseparably.

I monitored the email discussions rather a lot during the development of
ISO 14651 and it seemed awfully likely as a result of the discussions (plus
conversations that I've had with implementors in at least 3 companies) that
a specific collation would be built by constructing the collation element
table (as you mentioned in your note) and then "compiling" it into the code
that actually does the collation. That code would *inherently* have built
into it the levels that were specified in the collation table that was
constructed. It's not like the code can pick and choose which of the
levels it wishes to honor.

Of course, if you really want to specify an SQL collation name that somehow
identifies 2 or 3 or 4 (or more) collations built in conformance with ISO
14651 and then use an additional parameter to choose between them, I guess
that's possible (but not, IMHO, desirable). However, it would be very
difficult to enforce a rule that says that the collection of collations so
identified are "the same" except for the level chosen. One could be
oriented towards, say, French, and the other towards German or Thai and it
would be very hard for the SQL engine to know that it was being misled.

I hope that this allays your apparent assumption that my suggestion was
somehow based on some aspect of personal preference.

Thanks,
    Jim
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Oracle Corporation Oracle Email: mailto:jim.melton@oracle.com
1930 Viscounti Drive Standards email: mailto:jim.melton@acm.org
Sandy, UT 84093-1063 Personal email: mailto:jim.melton@acm.org
USA Fax : +1.801.942.3345
========================================================================
= Facts are facts. However, any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
========================================================================



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT