Re: Unicode collation algorithm - interpretation]

From: Tex Texin (texin@progress.com)
Date: Sun Feb 11 2001 - 15:03:10 EST


Mike, Jim,

I am confused by this thread so I will offer my perspective.

The collation algorithm is small and can be written to work
flexibly with different levels of sorting.

It is easy to have a parameterized table format so that
tables can have different levels.

I find I need to have the ability to do the following:

1) Users have different requirements, and being able to choose
case-insensitivity, accent-insensitivity, etc. is important.
Therefore having an API for comparison that allows "strength" variations
such as these is important. Users do want to change their comparisons
or sorting dynamically, so it should be built into the api not by
having separate tables. Yes, the code picks and chooses the
number of levels.

Personally, I don't like to build selection of the strength
into the table names, but as there is an equivalency between a
multicomponent name and a name coupled with additional options,
the choice is discretionary.

2) I doubt many applications compile the tables into their
code. Most applications want to have the flexibility to change
tables for different languages. Therefore the tables are externalized
and probably parameterized for efficiency. This also allows the
tables to be field-upgradable and easily improved or modified.

Certainly tables are compiled into binary formats for efficiency.

It is possible, that an application that does not need strong
precision in its sorting might limit its capabilities to fewer
levels. Although there might be some minimal savings in code
and processing for the algorithm, probably the true benefit is
smaller memory or disk footprints of the tables. If developers
said they had compiled in limitations in their programs I would
guess they were referring to memory and disk limitations they
imposed on theirselves.

3) I don't see a reason to build separate tables that are the
same except for different number of levels, for use by the same
software.

I would just have the maximum level table needed by the software
and let the algorithm choose the number of levels to use.

Having different tables for different software does make sense
since the software might have a different max number of levels
and could benefit from lower space requirements, etc.

Also, as the TR points out, whether or not the software
supports combining marks and whether a normalization pass
is made first might impact the tables, so these things vary
from application to application.

hth
tex

J M Sykes wrote:
>
> Jim,
>
> Thanks for the reply, which Hugh had indeed alerted me to expect. See
> interpolations below.
>
> > I particularly want to respond to the statement that you made:
> >
> > >It has been suggested that SQL <collation name> should instead identify
> > >both collation element table and maximum level.
> >
> > I believe that the "maximum level" is built into the
> > collation element table inseparably.
>
> I think you misunderstand me. The "maximum level" I was referring to is that
> mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for
> each string", para 2, which reads:
>
> <quote>
> An implementation may allow the maximum level to be set to a smaller level
> than the available levels in the collation element array. For example, if
> the maximum level is set to 2, then level 3 and higher weights (including
> the normalized Unicode string) are not appended to the sort key. Thus any
> differences at levels 3 and higher will be ignored, leveling any such
> differences in string comparison.
> </quote>
>
> There is, of course, an upper limit to the number of levels provided for in
> the Collation Element Table and 14651 requires that "The number of levels
> that the process supports ... shall be at least three." So I think it's fair
> to say we are discussing whether, and if so how, these levels should be made
> visible to the SQL user.
>
> We can safely assume that at least some users will require sometimes exact,
> sometimes inexact comparisons (at least for pseudo-equality, to a lesser
> extent for sorting).
>
> We can also safely assume that users will wish to get the performance
> benefit of some preprocessing.
>
> It is clearly possible to preprocess as far as the end of step 2 of the
> Unicode collation algorithm without committing to a level. I understand you
> to say that several implementors have concluded that this level of
> preprocessing is not cost-effective, in comparison to going all the way to
> the sort key. I am in no position to dispute that conclusion.
>
> > I monitored the email discussions rather a lot during the development of
> > ISO 14651 and it seemed awfully likely as a result of the
> > discussions (plus conversations that I've had with implementors in
> > at least 3 companies) that
> > a specific collation would be built by constructing the collation
> > element table (as you mentioned in your note) and then "compiling"
> > it into the code that actually does the collation.
> > That code would *inherently* have built
> > into it the levels that were specified in the collation table that was
> > constructed. It's not like the code can pick and choose which of the
> > levels it wishes to honor.
>
> I'm afraid I don't understand what this is saying. I've seen both the 104651
> "Common Template Table" and the Unicode "Default Unicode Collation Element
> Table", and assume them to be equivalent, but have not verified that they
> are. Neither of them looks particularly "compilable" to me but, in view of
> your quotes, I'm not at all clear what you mean by '"compiling" it into the
> code that actually does the collation.'
>
> I'm also unclear what an SQL-implementor is likely to supply as "a
> collation", though I imagine (only!) that it might be a part only of the
> CTT/CET appropriate to the script used by a particular culture, and with
> appropriate tailoring. But I have no reason to expect the executable
> ("compiled"?) code the implements the algorithm to vary depending on the
> collation, or on the level (case-blind &c) specified by the user for a
> particular comparison.
>
> I find it easier to imagine differences in code depending on whether a
> <collate clause> is in a <column definition> or in, say,
> WHERE C1 = C2 COLLATE <collation name>.
>
> > Of course, if you really want to specify an SQL collation name that
> > somehow identifies 2 or 3 or 4 (or more) collations built in
> > conformance with ISO
> > 14651 and then use an additional parameter to choose between them, I guess
> > that's possible (but not, IMHO, desirable).
>
> Unless you mean for performance reasons, I'd be interested to know why not
> desirable.
>
> > However, it would be very
> > difficult to enforce a rule that says that the collection of collations so
> > identified are "the same" except for the level chosen. One could be
> > oriented towards, say, French, and the other towards German or Thai and it
> > would be very hard for the SQL engine to know that it was being misled.
>
> I can see a problem in ensuring that COLLATE (Collate_Fr_Fr, 2) bears the
> same relation to COLLATE (Collate_Fr_Fr, 1) as COLLATE (Collate_Thai, 2)
> bears to COLLATE (Collate_Thai, 1), but I honestly don't know how
> significant that is, or even what "the same" ought to mean if Thai has no
> cases or diacritics anyway.
>
> This seems almost to be questioning the usefulness of levels. Perhaps they
> have values for some cultures but not others. If that's the case, I don't
> see that my suggestion is completely invalidated, though it's value might be
> so seriously reduced as to make it negligible.
>
> Mike.

-- 
According to Murphy, nothing goes according to Hoyle.
--------------------------------------------------------------------------
Tex Texin                      Director, International Business
mailto:Texin@Progress.com      +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.        14 Oak Park, Bedford, MA 01730

http://www.Progress.com #1 Embedded Database

Globalization Program http://www.Progress.com/partners/globalization.htm ---------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT