Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (
Date: Sun Oct 26 2003 - 17:09:37 CST

From: "Peter Kirk" <>
> I see the point, but I would think there was something seriously wrong
> with a database setup which could change its ordering algorithm without
> somehow declaring all existing indexes invalid.

Why would such a SQL engine do so, if what has changed is an external
utility library (for example provided by the OS), when the designer had
assumed (after reading the Unicode standard), that NF normalizations
were assumed to be stable across all backward and forward version
of Unicode, and that a previously tested full compliance could be reached
by using this external implementation instead of reimplementing it
internally the engine itself?

Such a change of policy in Unicode would mean a reduced interoperability
of existing compliant systems. What is worse, is that distributed systems
exchanging data normalized through a common service at one time could
experiment later incoherence in the normalization.

When I was speaking about full-text search capabilities for example, I meant
that the main role of combining classes is not to create grapheme clusters,
but to allow handling all canonically equivalent sequences using binary
compares instead of requiring constant renormalization to compare all
canonically equivalent strings occurences. As all Unicode algorithms are
defined to handle canonically equivalent strings the same way so that they
will return the same binary results from the same source, modifying the
canonical equivalences by merging existing combining classes would in fact
affect all standard (or proposed standard) Unicode algorithms, including the
most complex ones like collation and text break scanners.

We have no choice:
- either modifying bogous combining classes and breaking the stability pact
for backwards compatibility of normalized strings (at least those containing
only characters of the common assigned Unicode subset),
- or duplicate existing characters with newer codepoints with modified
properties and deprecating (not forbidding) the old ones.
- or include in the standard a way to override the combining class order
(with CGJ or a new specific and documented CCO control) if it is impossible
to deprecate existing characters.

I will approve the W3C requirement that really needs that normalized strings
in any version of Unicode stay normalized in ALL its versions. For any
reason, even if this order is illogical and does not work well with all
languistic usages; if it ever causes a problem in a particular language, one
has to propose, standardize and use come other character to solve it, but
not alter existing ones.

After all, that's what has been done since long in Unicode: not all
characters are unified, or given a canonical equivalence, even if those
characters always use the same glyph. Look for example the Greek characters
borrowed in the Latin script or in the Mathematical block: they were kept
separate, not unified and not canonically equivalent to preserve the
semantic of text using them. Why this "incoherent" status for individual
characters would not apply also to combining sequences when there are
legitimate reason to deunify them?

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST