About combining classes

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 27 2003 - 06:31:12 EDT

  • Next message: John Cowan: "[cowan: Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)]"

    When I just look at the history of combining classes, they did not exist in the first Unicode standard, and they still don't exist in ISO10646 as well.
    This was a technology developed by IBM and offered for free to the community to allow a simplified management of encoded texts, and it has long been informative (as well as the proposed normalization forms), before it was recognized it would be useful.

    However, if there are things that this added property of characters that may break the encoding of languages (including future languages that may be encoded), I think that this creates an opportunity to standardize the use of a specific character that will allow bypassing the constraints added by these now standard combining classes when it is needed.

    The case of Biblic Hebrew is what will occur in the future because combining classes have been defined to stay here for a long time, as it solves many problems with modern languages. Of course the CGJ character works, but we'll have more pressure in the future to use some bypassing encoding features when this is really needed for any newly encoded text.

    Without this added character (CGJ for example), all future encoded scripts may simply abandon the idea of assigning non-zero combining classes, despite they would be useful in many cases to detect the *most common* obvious equivalences and simplify the unification of text with the same semantic and graphical rendering.

    We *must not* come back on the encoding of Hebrew. Traditional Hebrew is definitely a distinct language, the same way that for Old Greek, or Old Hungarian, or the various regional forms of languages written historically with many variants of diacritics on Latin letters. This problem will become more important when Cuneiform or Phenician will be encoded, and I'm quite certain that many old Brahmic scripts will suffer of the same difficulties when we will try to adapt the model adopted for modern Brahmic scripts (and that work in their domain).

    If we cant to keep Unicode unified, we must not break this unification of characters by assigning new characters when this is not justified (there's *no* clear historic frontier between old and new versions of a language, and scripts have always evolved gradually, sometimes in parallel with contradictory rules).

    So if we need to be able to encode old historic text, we cannot avoid using some special combining mark on places where the unification with the "modern" usage of the script cause problems. In addition, we can accept the fact that old text will be more difficult to manage in softwares, if on the opposite the most common use of the script in modern languages requires being able to allow useful simplifications (such as combining classes).

    Let's keep the combining classes as they are defined now. They are useful but do not solve all the problem tied to the unification of encoded text. Working on old historic text is a matter of specialists and scholars, and all we need to do is to offer them a framework in which the modern simplifications will not cause them too much problems.

    That's why I think that using the CGJ combining character is not a "kludge" for Biblic Hebrew. This is an extension of the encoding of the modern script to allow encoding old texts, and this will probably appear later when studying all manuscripts of Latin or Greek, or Glagolitic texts, where the combining marks have slightly evolved in their glyphic position, meaning that the modern combining class may not be appropriate for the old uses.

    So it is simpler to say to scholars that study old languages that Unicode can offer them a way to unify their script with the modern script, if we allow and document more clearly that some control characters or special marks can be used to bypass the required constraints defined in the modern script. CGJ, if officially documented as a legal way to override the combining classes of combining characters that follow it so that they won't be reordered furing normalization, may prove to be useful in many future encoded old texts...

    -- Philippe.

    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 07:04:25 EDT