Re: Merging combining classes, was: New contribution N2676

From: Peter Kirk (
Date: Sun Oct 26 2003 - 13:36:51 CST

On 25/10/2003 19:00, Philippe Verdy wrote:

>From: "Peter Kirk" <>
>>I can see that there might be some problems in the changeover phase. But
>>these are basically the same problems as are present anyway, and at
>>least putting them into a changeover phase means that they go away
>>gradually instead of being standardised for ever, or however long
>>Unicode is planned to survive for.
>I had already thought about it. But this may cause more troubles in the
>future for handling languages (like modern Hebrew) in which those combining
>classes are not a problem, ...
This needs some clarification. Most modern Hebrew is written without any
combining marks, or sometimes with just a few scattered ones for
specific disambiguation. In such cases the combining classes of Hebrew
marks are irrelevant because they never appear in combination. But
sometimes, especially in texts for children and language learners,
modern Hebrew is written with vowel points, dagesh and sin and shin
dots, although not usually accents. In this case, not just in biblical
Hebrew, the combining classes ARE a problem, because they imply a
canonical order which is illogical as well as hard to render.

>... and where the ordering of combining characters is
>a real bonus that would be lost if combining classes are merged, notably for
>full text searches where the number of order combinations to search could
>explode, as the effective order in occurences could become unpredictable for
But there is no bonus from the ordering of combining classes, but rather
a detrimental effect. Full text searches are already seriously
complicated because what is logically one character is split in the
canonical order. The relative ordering of sin and shin dot with vowel
points leads to a situation equivalent to the French sequence
<c-cedilla, a> being represented canonically as <c, a, cedilla> - except
that also a dagesh and a meteg may be inserted between the equivalents
of c and cedilla. That is not exactly a bonus if you want to search for
the consonant c-cedilla.

Yes, the effective order of occurrences could become unpredictable if
characters were not entered in the recommended order, i.e. words were
misspelled. But that is true in any language: simple searches will not
find misspelled words.

>Of course, if the combining class values were really bogous, a much simpler
>way would be to deprecate some existing characters, allowing new
>applications to use the new replacement characters, and slowly adapt the
>existing documents with the replacement characters whose combining classes
>would be more language-friendly.
This has already been suggested. The problem is the old one that this
effectively deprecates all existing pointed Hebrew text, and
implementations and fonts based on the current definitions.

>>As for requirements that lists
>>are normalised and sorted, I would consider that a process that makes
>>assumptions, without checking, about data received from another process
>>under separate control is a process badly implemented and asking for
>Here the problem is that we will not always have to manage the case of
>separate processes, but also the case of utility libraries: if this library
>is upgraded separately, the application using it may start experimenting
>problems. e.g. I am thinking about the implied sort order in SQL databases
>for table indices: what would happen if the SQL server is stopped just the
>time to upgrade a standard library implementing the normalization among many
>other services, because another security bug such as a buffer overrun is
>solved in another API? When restarting the SQL server with the new library
>implementing the new normalization, nothing would happen, apparently, but
>the sort order would no more be guaranteed, and stored sorted indices would
>start being "corrupted", in a way that would invalidate binary searches
>(meaning that some unique keys could become duplicated, or not found,
>producing unpredictable results, critical if they are assumed for, say, user
>authentication, or file existence).
I see the point, but I would think there was something seriously wrong
with a database setup which could change its ordering algorithm without
somehow declaring all existing indexes invalid.

Peter Kirk (personal) (work)

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST