Re: Unicode collation algorithm - interpretation]

From: Mark Davis (
Date: Sun Feb 11 2001 - 16:26:37 EST

I agree with Tex that the algorithm is small, if implemented in the
straightforward way. I also agree with his #1, #2, and #3. I will add two

1. Where performance is important, and where people start adding options
(e.g. uppercase < lowercase vs. the reverse), the implemenation of collation
becomes rather tricky. Not unexpectedly -- doing any programming task under
memory and performance constraints is what separates the adults from the

People interested in collation may want to take a look at the open-source
ICU design currently being implemented. The document is posted on the ICU
site; I have a copy of the latest version on Feedback is welcome,
of course.

2. There seems to be a confusion between the datatables used to support
collation, and the sort keys -- say for an index -- that are generated from
those tables. As Tex said, in the datatables you would go ahead and include
the level data all the time. Not much value to producing different tables,
except for very limited environments.

On the other hand, it may well be useful to generate sort key indices with
different levels, since that can reduce the size of your index where the
less significant levels are not important. When searching an index using
sort keys, you definitely need to use the same parameters in generating your
query sort key as you uses when generating the sort key index; certain
combinations of parameters will have incomparable sort keys.

If you used a 3-level sort with SHIFTED alternates in your index, for
example, then you need to reproduce that in your query. (BTW, L3_SHIFTED is
probably the most useful combination to use in general; for the vast
majority of applications more than 3 levels simply bloats the index with
little value to end users.)

A loose match can use a query sort key with fewer levels than the sort index
used, but this needs a small piece of logic to generate the upper and lower
bounds for the loose match.


----- Original Message -----
From: "Tex Texin" <>
To: "Unicode List" <>
Cc: "Unicode List" <>; "Fred Zemke"
Sent: Sunday, February 11, 2001 11:42
Subject: Re: Unicode collation algorithm - interpretation]

> Mike, Jim,
> I am confused by this thread so I will offer my perspective.
> The collation algorithm is small and can be written to work
> flexibly with different levels of sorting.
> It is easy to have a parameterized table format so that
> tables can have different levels.
> I find I need to have the ability to do the following:
> 1) Users have different requirements, and being able to choose
> case-insensitivity, accent-insensitivity, etc. is important.
> Therefore having an API for comparison that allows "strength" variations
> such as these is important. Users do want to change their comparisons
> or sorting dynamically, so it should be built into the api not by
> having separate tables. Yes, the code picks and chooses the
> number of levels.
> Personally, I don't like to build selection of the strength
> into the table names, but as there is an equivalency between a
> multicomponent name and a name coupled with additional options,
> the choice is discretionary.
> 2) I doubt many applications compile the tables into their
> code. Most applications want to have the flexibility to change
> tables for different languages. Therefore the tables are externalized
> and probably parameterized for efficiency. This also allows the
> tables to be field-upgradable and easily improved or modified.
> Certainly tables are compiled into binary formats for efficiency.
> It is possible, that an application that does not need strong
> precision in its sorting might limit its capabilities to fewer
> levels. Although there might be some minimal savings in code
> and processing for the algorithm, probably the true benefit is
> smaller memory or disk footprints of the tables. If developers
> said they had compiled in limitations in their programs I would
> guess they were referring to memory and disk limitations they
> imposed on theirselves.
> 3) I don't see a reason to build separate tables that are the
> same except for different number of levels, for use by the same
> software.
> I would just have the maximum level table needed by the software
> and let the algorithm choose the number of levels to use.
> Having different tables for different software does make sense
> since the software might have a different max number of levels
> and could benefit from lower space requirements, etc.
> Also, as the TR points out, whether or not the software
> supports combining marks and whether a normalization pass
> is made first might impact the tables, so these things vary
> from application to application.
> hth
> tex
> J M Sykes wrote:
> >
> > Jim,
> >
> > Thanks for the reply, which Hugh had indeed alerted me to expect. See
> > interpolations below.
> >
> > > I particularly want to respond to the statement that you made:
> > >
> > > >It has been suggested that SQL <collation name> should instead
> > > >both collation element table and maximum level.
> > >
> > > I believe that the "maximum level" is built into the
> > > collation element table inseparably.
> >
> > I think you misunderstand me. The "maximum level" I was referring to is
> > mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key
> > each string", para 2, which reads:
> >
> > <quote>
> > An implementation may allow the maximum level to be set to a smaller
> > than the available levels in the collation element array. For example,
> > the maximum level is set to 2, then level 3 and higher weights
> > the normalized Unicode string) are not appended to the sort key. Thus
> > differences at levels 3 and higher will be ignored, leveling any such
> > differences in string comparison.
> > </quote>
> >
> > There is, of course, an upper limit to the number of levels provided for
> > the Collation Element Table and 14651 requires that "The number of
> > that the process supports ... shall be at least three." So I think it's
> > to say we are discussing whether, and if so how, these levels should be
> > visible to the SQL user.
> >
> > We can safely assume that at least some users will require sometimes
> > sometimes inexact comparisons (at least for pseudo-equality, to a lesser
> > extent for sorting).
> >
> > We can also safely assume that users will wish to get the performance
> > benefit of some preprocessing.
> >
> > It is clearly possible to preprocess as far as the end of step 2 of the
> > Unicode collation algorithm without committing to a level. I understand
> > to say that several implementors have concluded that this level of
> > preprocessing is not cost-effective, in comparison to going all the way
> > the sort key. I am in no position to dispute that conclusion.
> >
> > > I monitored the email discussions rather a lot during the development
> > > ISO 14651 and it seemed awfully likely as a result of the
> > > discussions (plus conversations that I've had with implementors in
> > > at least 3 companies) that
> > > a specific collation would be built by constructing the collation
> > > element table (as you mentioned in your note) and then "compiling"
> > > it into the code that actually does the collation.
> > > That code would *inherently* have built
> > > into it the levels that were specified in the collation table that was
> > > constructed. It's not like the code can pick and choose which of the
> > > levels it wishes to honor.
> >
> > I'm afraid I don't understand what this is saying. I've seen both the
> > "Common Template Table" and the Unicode "Default Unicode Collation
> > Table", and assume them to be equivalent, but have not verified that
> > are. Neither of them looks particularly "compilable" to me but, in view
> > your quotes, I'm not at all clear what you mean by '"compiling" it into
> > code that actually does the collation.'
> >
> > I'm also unclear what an SQL-implementor is likely to supply as "a
> > collation", though I imagine (only!) that it might be a part only of the
> > CTT/CET appropriate to the script used by a particular culture, and with
> > appropriate tailoring. But I have no reason to expect the executable
> > ("compiled"?) code the implements the algorithm to vary depending on the
> > collation, or on the level (case-blind &c) specified by the user for a
> > particular comparison.
> >
> > I find it easier to imagine differences in code depending on whether a
> > <collate clause> is in a <column definition> or in, say,
> > WHERE C1 = C2 COLLATE <collation name>.
> >
> > > Of course, if you really want to specify an SQL collation name that
> > > somehow identifies 2 or 3 or 4 (or more) collations built in
> > > conformance with ISO
> > > 14651 and then use an additional parameter to choose between them, I
> > > that's possible (but not, IMHO, desirable).
> >
> > Unless you mean for performance reasons, I'd be interested to know why
> > desirable.
> >
> > > However, it would be very
> > > difficult to enforce a rule that says that the collection of
collations so
> > > identified are "the same" except for the level chosen. One could be
> > > oriented towards, say, French, and the other towards German or Thai
and it
> > > would be very hard for the SQL engine to know that it was being
> >
> > I can see a problem in ensuring that COLLATE (Collate_Fr_Fr, 2) bears
> > same relation to COLLATE (Collate_Fr_Fr, 1) as COLLATE (Collate_Thai, 2)
> > bears to COLLATE (Collate_Thai, 1), but I honestly don't know how
> > significant that is, or even what "the same" ought to mean if Thai has
> > cases or diacritics anyway.
> >
> > This seems almost to be questioning the usefulness of levels. Perhaps
> > have values for some cultures but not others. If that's the case, I
> > see that my suggestion is completely invalidated, though it's value
might be
> > so seriously reduced as to make it negligible.
> >
> > Mike.
> --
> According to Murphy, nothing goes according to Hoyle.
> --------------------------------------------------------------------------
> Tex Texin Director, International Business
> +1-781-280-4271 Fax:+1-781-280-4655
> Progress Software Corp. 14 Oak Park, Bedford, MA 01730
> #1 Embedded Database
> Globalization Program
> --------------------------------------------------------------------------

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT