Re: Yerushala(y)im - or Biblical Hebrew

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jul 02 2003 - 08:43:29 EDT

  • Next message: Owen Taylor: "Re: Biblical Hebrew (U+034F Combining Grapheme Joiner works)"

    On Wednesday, July 02, 2003 12:55 PM, Jony Rosenne <rosennej@qsm.co.il> wrote:

    > I would like to summarize my understanding:
    >
    > 1. The sequence Lamed Patah Hiriq is invalid for Hebrew. It is
    > invalid in Hebrew to have two vowels for one letter. It may or may
    > not be a valid Unicode sequence, but there are many examples of valid
    > Unicode sequences that are invalid.

    Only invalid for Modern Hebrew. In addition we are not discussing about the *validity* of the Unicode/ISO10646 encoding (any Unicode string is valid even if it is not normalized, provided that it uses normalized codepoints, and respect a few constraints such as approved variant sequences, and valid usage of surrogate code units, but forbidden use of surrogate codepoints).

    The issue created by the Unicode normalization of text which is NOT required for Unicode encoding validity, but only for text processing (notably with the legacy HTML and SGML or the newer XML, XHTML and related standards based on XML).

    You have not understood the issue with *Traditional Hebrew* where there are actually two or more vowels for one base letter notably in Biblic texts but certainly in many other manuscripts of the same epochs, and probably after and still today, as long as these important texts for the human culture have been (and will be) studied by scholars and searchers or interested people, whever they were (are or will be) historians, sociologists, economists, linguists, translators, theologists, religious adepts, or many other scientific searches in various domains studied since milleniums (including mathematics, astronomy, medecine...).

    What has been demonstrated here is that the current combining classes defined on Hebrew characters were not needed for Modern Hebrew (which could have been written perfectly with all vowels defined with CC=0), but encoded with "randomly assigned" combining classes on vowels (for which the 220 and 230 classes were not usable).

    The initial encoding may have been done by studying some fragments only of the traditional texts, which exposed some combinations of vowels, and without really searching in such important traditional texts such as the Hebrew Bible (and also certainly in some old versions of the Torah, or some old translations to Hebrew of the Coran, or of famous Roman Latin, Greek, Phenician, or Syriac manuscripts, in a Middle-East region that has seen a lot of foreign invasions and been in the crossroad of all most famous cultures and commercial roads). For all vowels for which there did not seem to exist a demonstrated preference order (in the studied fragments of text), the combining classes have been mostly defined in a order matching the codepoint order in the legacy 8-bit encodings, thinking that occurences of those owels would be rare and would not cause problems.

    When there will be new old scripts added in Unicode, I do think that Unicode should not make assumptions from a small set of text fragments: further researches may demonstrate that a definition of non-zero combining classes would introduce too much problems to allow encoding new texts, for which an existing normalization would incorrectly swap combining letters and change the semantic of the encoded text. These old texts should be handled assuming that the typist which entered and encoded them was correct in its transcription, and a NF* normalization should not change this decision automatically, as it would frustrate all the efforts performed by the transcripter to produce an accurate transcript of the encoded text.

    I think that if there are some reasons to define some combining classes for the normalization of some categories of text, we should accept to sacrifice the unification of characters, each time it will cause a problem, or Unicode and ISO10646 should accept to define/assign a generic codepoint with class "Mn", CC=0, whose only role will be to bypass the currently assigned non-zero CC value of combining characters, even if, temporarily, this causes some problems for text rendering engines (which can be corrected later to consider this character as ignorable for all rendering purpose, including searches of possible ligatures).

    I suggest that such codepoint be allocated in the U+03XX block for generic combining characters, so that it can be used in any script, including the existing ones. This character would be named "Combining Variant Selector" (CVS), it would preserve the semantic of the diacritic to which it is prefixed, and it would not override the current semantic of the "Combining Grapheme Joiner" (CGJ) that may have specific usage to create ligatures between diacritics, and that should still continue to be canonically ordered, so that if the diacritic <A> has a CC=a and diacritic <B> has a CC=b, and if (a < b), the sequence <A,CGJ,B> would be valid, but not <B, CGJ, A> unless the combining class of A is overriden with <B, CGJ, CVS, A>.

    This definition preserves the current semantic of the CGJ (without extending it too much in a way that was not intended when it was defined), and it makes possible to define combining classes for the most usual cases of an encoded script, without compromizing the future, if more rare texts are discovered where the first unification works violate the old text semantics for normalization.



    This archive was generated by hypermail 2.1.5 : Wed Jul 02 2003 - 09:40:09 EDT