Re: Merging combining classes, was: New contribution N2676

From: Peter Kirk (
Date: Tue Oct 28 2003 - 11:09:56 CST

On 28/10/2003 04:49, Kent Karlsson wrote:

>Philippe Verdy wrote:
>>There's a counter example with the position of the circumflex on the
>>lowercase t (I can't remember for which language it occurs,
>>sorry), which is
>>in some cases not the one that its combining class would
>>normally take.
>There are also the cases of comma below a small g (Lithuanian),
>which is rendered turned above the g, and of ring below g (IPA)
>that should be rendered above the g... Neither of these invalidate,
>or puts to question, the combining classes of comma below (and
>cedilla...) or ring below, as far as I can see.
Also, in the commonly used Hebrew *transliteration*, the same function
(fricative pronunciation) is indicated by a macron above g and p but
below b, d, k and t, for the same reason. It occurs only with these
letters (sometimes also written below h). There might be an argument for
using instead of g and p plus combining macron g and p plus combining
line below - especially as if these were ever capitalised the line would
probably be moved below. But there would need to be a clear rule that
such combining marks are moved from below to above g and p.

>So far, it has been noticed that some Hebrew and Arabic marks,
>mostly the vowel marks, ...
For Hebrew also dagesh, rafe, sin and shin dots, and meteg; and for
Arabic, shadda. Basically anything with "unique" combining classes, a
concept which seems to have been removed from the text, but not removed
from the database as it should have been.

>... have inappropriate combining classes.
>The solution suggested by the UTC is to use CGJ. But it also has
>to be simple and practicable. Putting a CGJ after each occurrence
>of the characters with badly assigned combining class effectively
>gives them a combining class of 0. Perhaps not ideal, and indeed
>a kludge. But simple and practical. A keyboard layout, for instance,
>can just generate a CGJ after each troublesome Arabic and Hebrew
>mark. With current keyboard layout specification mechanisms,
>that's about the best that can be done on the keyboard side of it.
That depends on the mechanism. With a mechanism such as Keyman from, it is possible to define that, for example, key A
generates <CGJ, patah> if the previous key press generated a dagesh or
sin or shin dot, but just patah if the previous key press generated just
a base character. Such a mechanism can stop superfluous CGJ's being
generated in continuously typed text, but it cannot cope properly with
editing as it does not have access to the environment of text already
entered, only to what the keyboard has previously generated. More
comprehensive mechanisms can be defined but they require the keyboard to
have access to the backing store.

>Removing superfluous CGJs should be done by a separate utility.
>Trying to build that into normalisation is probably not such a good
Understood. Could it perhaps be defined in Unicode as an additional
pre-normalisation step which is recommended but not required?

It would of course be trivial to specify that CGJ or CCO is ignored in
collation. In fact I think CGJ already is. This implies that superfluous
CGJs do not affect searching, sorting and spell checking. As long as
fonts also ignore them (except in special show all characters modes),
the main detrimental effect will be to waste a lot of storage space.

>Defining new characters to replace the troublesome ones, a more
>elegant solution, has been rejected by the UTC. On compatibility
>grounds, IIRC.
> /kent k
Was this actually considered and rejected by the UTC? I understood that
the proposal, for Hebrew
had simply not been proceeded with, on the basis of widespread
opposition expressed on this list and the general acceptance (including
by the UTC, items
96-C20 and 96-A72) of the CGJ alternative. I am not trying to resurrect
the proposal which I oppose, but there are people who are still
concerned that it might reappear, and be pushed through the UTC by the
consortium members who support it, without adequate reference back to
the objectors who are not represented on the UTC. So it would be good
news if the UTC had actually rejected it.

Peter Kirk (personal) (work)

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST