Re: Case blind comparison

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 11 1997 - 13:25:54 EDT


Keld responded to Gary Roberts:

>
> > Thanks for the information, but I don't understand why this is
> > important for a `case-folded' `loose comparison'.
> >
> > >From a user standpoint, they are asking for a case blind comparison.
> > What characters do they want to be equal?
>
> Please also look at the new ISO standard in preparation, ISO 14651.
> It has a function, that does case-folded "loose comparison",
> and it has a template table for all of UCS.
>
> The spec is available at the www.dkuug.dk/jtc1/sc22/wg20/prot page
>
> Keld
>

Well, um, yes but...

ISO/IEC CD 14651 - "International String Ordering - Method for comparing
Character Strings and Description of a Default Tailorable Ordering"
does in fact contain the description of a proposed standard API (in
the Posix model) for doing string comparison. The comparison operation
API consists of "three subprogrammes called COMPCAR, COMPBIN and
CARABIN..." Removing the standards gobbledgook, COMPCAR does comparison
directly on strings, CARABIN generates binary key structures based
on the multilevel tables to enable stored, precomputed keys, and
COMPBIN does comparison based on the binary key structures.

While CD 14651 spells out the multilevel sorting model in great
detail (based largely on the Canadian standard), the comparison
operation API leaves much to be desired. In particular, CARABIN
is modeled on strxfrm(), so that low levels of conformance can
simply use strxfrm() and claim conformance with the standard.
But CARABIN does not specify the binary structure required for
higher levels of conformance. Basically it leaves that up to
the implementer as an implementation detail. Essentially the
binary string is a private structure only consumable by the
corresponding implementation of COMPBIN. This is feasible, I suppose,
but it does mean that the proposed international standard is not
proposing a transmissible *data* standard, but instead an API
with multiple levels of conformance and multiple options that
will result in typical Unix implementations with subtle mismatches
and incompatibilities of implementation.

What about COMPCAR for loose comparison? The problem is that
COMPCAR only introduces the *possibility* of loose comparison
at level 5 conformance, the highest level with the most stringent
requirements on implementation. In particular, the level "parameter
is mandatory only for conformance level 5. When it is not present,
the assumed value of this parameter is zero, which implies that the
comparison is done up to the last available level." But comparison
up to the last available level implies the full use of the
multilevel algorithm, including ignorables, (as is appropriate
for determinant sorting algorithms) and doesn't get you loose
comparison. Loose comparison, as for Gary's "case blind comparison",
requires specific omission of one or more levels in the generic
algorithm, and in particular requires truly ignoring ignorables,
instead of using them for tie-breaking. But ISO 14651 is going
to be irrelevant for loose comparison except for those Posix-
compliant platforms which choose to implement COMPCAR at the
highest level of conformance.

> and it has a template table for all of UCS.

I must take issue with this statement, as well. The template table
does make a serious effort to cover Latin, Greek, Cyrillic,
Armenian, Hebrew, and Arabic. The basic problem is that it treats
all other scripts as consisting of ignorables, which clearly
produces incorrect results for a default collation. So the
coverage can hardly claim to be for "all" of UCS, except in a
very defective sense. Furthermore, there are inconsistencies
in the treatment of accents between the scripts which are
covered. Combining marks are not covered in any way which could
be considered consistent with Unicode, which would result in
erratic and inconsistent results if the comparison API's are
applied to Unicode data which includes combining marks. And
the specification of the table template itself follows the Posix
charset model, resulting in a table whose significance for ordering
of characters cannot be determined by inspection outside the
context of an actual implementation of the weighting scheme
implied. These and other defects have been noted in the U.S.
comments on the CD 14651 document.

So before heaving a sigh of relief that an answer is just
around the corner, and it will be an ISO standard, to boot,
you might want to cast a critical eye on the actual document
that Keld is promoting.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT