Re: Unicode collation algorithm - interpretation]

From: Mark Davis (markdavis34@home.com)
Date: Sun Feb 11 2001 - 16:26:37 EST

Next message: Jungshik Shin: "Re: [OT] RE: FW: extracting words"
Previous message: Mark Davis: "Re: extracting words"
Maybe in reply to: J M Sykes: "Re: Unicode collation algorithm - interpretation]"
Next in thread: Jim Melton: "Re: Unicode collation algorithm - interpretation]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree with Tex that the algorithm is small, if implemented in the
straightforward way. I also agree with his #1, #2, and #3. I will add two
things:

1. Where performance is important, and where people start adding options
(e.g. uppercase < lowercase vs. the reverse), the implemenation of collation
becomes rather tricky. Not unexpectedly -- doing any programming task under
memory and performance constraints is what separates the adults from the
children.

People interested in collation may want to take a look at the open-source
ICU design currently being implemented. The document is posted on the ICU
site; I have a copy of the latest version on
http://www.macchiato.com/uca/ICU_collation_design.htm. Feedback is welcome,
of course.

2. There seems to be a confusion between the datatables used to support
collation, and the sort keys -- say for an index -- that are generated from
those tables. As Tex said, in the datatables you would go ahead and include
the level data all the time. Not much value to producing different tables,
except for very limited environments.

On the other hand, it may well be useful to generate sort key indices with
different levels, since that can reduce the size of your index where the
less significant levels are not important. When searching an index using
sort keys, you definitely need to use the same parameters in generating your
query sort key as you uses when generating the sort key index; certain
combinations of parameters will have incomparable sort keys.

If you used a 3-level sort with SHIFTED alternates in your index, for
example, then you need to reproduce that in your query. (BTW, L3_SHIFTED is
probably the most useful combination to use in general; for the vast
majority of applications more than 3 levels simply bloats the index with
little value to end users.)

A loose match can use a query sort key with fewer levels than the sort index
used, but this needs a small piece of logic to generate the upper and lower
bounds for the loose match.

Mark

----- Original Message -----
From: "Tex Texin" <texin@progress.com>
To: "Unicode List" <unicode@unicode.org>
Cc: "Unicode List" <unicode@unicode.org>; "Fred Zemke"
<fred.zemke@oracle.com>
Sent: Sunday, February 11, 2001 11:42
Subject: Re: Unicode collation algorithm - interpretation]

> Mike, Jim,
>
> I am confused by this thread so I will offer my perspective.
>
> The collation algorithm is small and can be written to work
> flexibly with different levels of sorting.
>
> It is easy to have a parameterized table format so that
> tables can have different levels.
>
> I find I need to have the ability to do the following:
>
> 1) Users have different requirements, and being able to choose
> case-insensitivity, accent-insensitivity, etc. is important.
> Therefore having an API for comparison that allows "strength" variations
> such as these is important. Users do want to change their comparisons
> or sorting dynamically, so it should be built into the api not by
> having separate tables. Yes, the code picks and chooses the
> number of levels.
>
> Personally, I don't like to build selection of the strength
> into the table names, but as there is an equivalency between a
> multicomponent name and a name coupled with additional options,
> the choice is discretionary.
>
> 2) I doubt many applications compile the tables into their
> code. Most applications want to have the flexibility to change
> tables for different languages. Therefore the tables are externalized
> and probably parameterized for efficiency. This also allows the
> tables to be field-upgradable and easily improved or modified.
>
> Certainly tables are compiled into binary formats for efficiency.
>
> It is possible, that an application that does not need strong
> precision in its sorting might limit its capabilities to fewer
> levels. Although there might be some minimal savings in code
> and processing for the algorithm, probably the true benefit is
> smaller memory or disk footprints of the tables. If developers
> said they had compiled in limitations in their programs I would
> guess they were referring to memory and disk limitations they
> imposed on theirselves.
>
> 3) I don't see a reason to build separate tables that are the
> same except for different number of levels, for use by the same
> software.
>
> I would just have the maximum level table needed by the software
> and let the algorithm choose the number of levels to use.
>
> Having different tables for different software does make sense
> since the software might have a different max number of levels
> and could benefit from lower space requirements, etc.
>
> Also, as the TR points out, whether or not the software
> supports combining marks and whether a normalization pass
> is made first might impact the tables, so these things vary
> from application to application.
>
> hth
> tex
>
>
> J M Sykes wrote:
> >
> > Jim,
> >
> > Thanks for the reply, which Hugh had indeed alerted me to expect. See
> > interpolations below.
> >
> > > I particularly want to respond to the statement that you made:
> > >
> > > >It has been suggested that SQL <collation name> should instead
identify
> > > >both collation element table and maximum level.
> > >
> > > I believe that the "maximum level" is built into the
> > > collation element table inseparably.
> >
> > I think you misunderstand me. The "maximum level" I was referring to is
that
> > mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key
for
> > each string", para 2, which reads:
> >
> > <quote>
> > An implementation may allow the maximum level to be set to a smaller
level
> > than the available levels in the collation element array. For example,
if
> > the maximum level is set to 2, then level 3 and higher weights
(including
> > the normalized Unicode string) are not appended to the sort key. Thus
any
> > differences at levels 3 and higher will be ignored, leveling any such
> > differences in string comparison.
> > </quote>
> >
> > There is, of course, an upper limit to the number of levels provided for
in
> > the Collation Element Table and 14651 requires that "The number of
levels
> > that the process supports ... shall be at least three." So I think it's
fair
> > to say we are discussing whether, and if so how, these levels should be
made
> > visible to the SQL user.
> >
> > We can safely assume that at least some users will require sometimes
exact,
> > sometimes inexact comparisons (at least for pseudo-equality, to a lesser
> > extent for sorting).
> >
> > We can also safely assume that users will wish to get the performance
> > benefit of some preprocessing.
> >
> > It is clearly possible to preprocess as far as the end of step 2 of the
> > Unicode collation algorithm without committing to a level. I understand
you
> > to say that several implementors have concluded that this level of
> > preprocessing is not cost-effective, in comparison to going all the way
to
> > the sort key. I am in no position to dispute that conclusion.
> >
> > > I monitored the email discussions rather a lot during the development
of
> > > ISO 14651 and it seemed awfully likely as a result of the
> > > discussions (plus conversations that I've had with implementors in
> > > at least 3 companies) that
> > > a specific collation would be built by constructing the collation
> > > element table (as you mentioned in your note) and then "compiling"
> > > it into the code that actually does the collation.
> > > That code would *inherently* have built
> > > into it the levels that were specified in the collation table that was
> > > constructed. It's not like the code can pick and choose which of the
> > > levels it wishes to honor.
> >
> > I'm afraid I don't understand what this is saying. I've seen both the
104651
> > "Common Template Table" and the Unicode "Default Unicode Collation
Element
> > Table", and assume them to be equivalent, but have not verified that
they
> > are. Neither of them looks particularly "compilable" to me but, in view
of
> > your quotes, I'm not at all clear what you mean by '"compiling" it into
the
> > code that actually does the collation.'
> >
> > I'm also unclear what an SQL-implementor is likely to supply as "a
> > collation", though I imagine (only!) that it might be a part only of the
> > CTT/CET appropriate to the script used by a particular culture, and with
> > appropriate tailoring. But I have no reason to expect the executable
> > ("compiled"?) code the implements the algorithm to vary depending on the
> > collation, or on the level (case-blind &c) specified by the user for a
> > particular comparison.
> >
> > I find it easier to imagine differences in code depending on whether a
> > <collate clause> is in a <column definition> or in, say,
> > WHERE C1 = C2 COLLATE <collation name>.
> >
> > > Of course, if you really want to specify an SQL collation name that
> > > somehow identifies 2 or 3 or 4 (or more) collations built in
> > > conformance with ISO
> > > 14651 and then use an additional parameter to choose between them, I
guess
> > > that's possible (but not, IMHO, desirable).
> >
> > Unless you mean for performance reasons, I'd be interested to know why
not
> > desirable.
> >
> > > However, it would be very
> > > difficult to enforce a rule that says that the collection of
collations so
> > > identified are "the same" except for the level chosen. One could be
> > > oriented towards, say, French, and the other towards German or Thai
and it
> > > would be very hard for the SQL engine to know that it was being
misled.
> >
> > I can see a problem in ensuring that COLLATE (Collate_Fr_Fr, 2) bears
the
> > same relation to COLLATE (Collate_Fr_Fr, 1) as COLLATE (Collate_Thai, 2)
> > bears to COLLATE (Collate_Thai, 1), but I honestly don't know how
> > significant that is, or even what "the same" ought to mean if Thai has
no
> > cases or diacritics anyway.
> >
> > This seems almost to be questioning the usefulness of levels. Perhaps
they
> > have values for some cultures but not others. If that's the case, I
don't
> > see that my suggestion is completely invalidated, though it's value
might be
> > so seriously reduced as to make it negligible.
> >
> > Mike.
>
> --
> According to Murphy, nothing goes according to Hoyle.
> --------------------------------------------------------------------------
> Tex Texin Director, International Business
> mailto:Texin@Progress.com +1-781-280-4271 Fax:+1-781-280-4655
> Progress Software Corp. 14 Oak Park, Bedford, MA 01730
>
> http://www.Progress.com #1 Embedded Database
>
> Globalization Program
> http://www.Progress.com/partners/globalization.htm
> --------------------------------------------------------------------------
-

Next message: Jungshik Shin: "Re: [OT] RE: FW: extracting words"
Previous message: Mark Davis: "Re: extracting words"
Maybe in reply to: J M Sykes: "Re: Unicode collation algorithm - interpretation]"
Next in thread: Jim Melton: "Re: Unicode collation algorithm - interpretation]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT