RE: Yerushala(y)im - or Biblical Hebrew

From: Jony Rosenne (
Date: Wed Jul 02 2003 - 10:42:11 EDT

  • Next message: Doug Ewell: "Re: Accented ij ligatures (was: Unicode Public Review Issues update)"

    I cannot agree with some of these statements. My comments are inserted.


    > -----Original Message-----
    > From: Philippe Verdy []
    > Sent: Wednesday, July 02, 2003 2:43 PM
    > To: Jony Rosenne
    > Cc:
    > Subject: Re: Yerushala(y)im - or Biblical Hebrew
    > On Wednesday, July 02, 2003 12:55 PM, Jony Rosenne
    > <> wrote:
    > > I would like to summarize my understanding:
    > >
    > > 1. The sequence Lamed Patah Hiriq is invalid for Hebrew. It
    > is invalid
    > > in Hebrew to have two vowels for one letter. It may or may not be a
    > > valid Unicode sequence, but there are many examples of
    > valid Unicode
    > > sequences that are invalid.
    > Only invalid for Modern Hebrew.

    No - it is true also for Biblical Hebrew and any other. The extra vowel
    belongs to another letter, which is known to exist but isn't printed.

    > In addition we are not
    > discussing about the *validity* of the Unicode/ISO10646
    > encoding (any Unicode string is valid even if it is not
    > normalized, provided that it uses normalized codepoints, and
    > respect a few constraints such as approved variant sequences,
    > and valid usage of surrogate code units, but forbidden use of
    > surrogate codepoints).

    I tried to say that although it may be valid Unicode, it is not valid

    > The issue created by the Unicode normalization of text which
    > is NOT required for Unicode encoding validity, but only for
    > text processing (notably with the legacy HTML and SGML or the
    > newer XML, XHTML and related standards based on XML).
    > You have not understood the issue with *Traditional Hebrew*
    > where there are actually two or more vowels for one base
    > letter notably in Biblic texts but certainly in many other
    > manuscripts of the same epochs, and probably after and still
    > today, as long as these important texts for the human culture
    > have been (and will be) studied by scholars and searchers or
    > interested people, whever they were (are or will be)
    > historians, sociologists, economists, linguists, translators,
    > theologists, religious adepts, or many other scientific
    > searches in various domains studied since milleniums
    > (including mathematics, astronomy, medecine...).

    See above.

    > What has been demonstrated here is that the current combining
    > classes defined on Hebrew characters were not needed for
    > Modern Hebrew (which could have been written perfectly with
    > all vowels defined with CC=0), but encoded with "randomly
    > assigned" combining classes on vowels (for which the 220 and
    > 230 classes were not usable).

    Unicode Hebrew points and cantillation marks were defined with Biblical
    Hebrew in mind.

    > The initial encoding may have been done by studying some
    > fragments only of the traditional texts, which exposed some
    > combinations of vowels, and without really searching in such
    > important traditional texts such as the Hebrew Bible (and
    > also certainly in some old versions of the Torah, or some old
    > translations to Hebrew of the Coran, or of famous Roman
    > Latin, Greek, Phenician, or Syriac manuscripts, in a
    > Middle-East region that has seen a lot of foreign invasions
    > and been in the crossroad of all most famous cultures and
    > commercial roads). For all vowels for which there did not
    > seem to exist a demonstrated preference order (in the studied
    > fragments of text), the combining classes have been mostly
    > defined in a order matching the codepoint order in the legacy
    > 8-bit encodings, thinking that occurences of those owels
    > would be rare and would not cause problems.

    There are no such cases, barring misunderstandings.

    > When there will be new old scripts added in Unicode, I do
    > think that Unicode should not make assumptions from a small
    > set of text fragments: further researches may demonstrate
    > that a definition of non-zero combining classes would
    > introduce too much problems to allow encoding new texts, for
    > which an existing normalization would incorrectly swap
    > combining letters and change the semantic of the encoded
    > text. These old texts should be handled assuming that the
    > typist which entered and encoded them was correct in its
    > transcription, and a NF* normalization should not change this
    > decision automatically, as it would frustrate all the efforts
    > performed by the transcripter to produce an accurate
    > transcript of the encoded text.
    > I think that if there are some reasons to define some
    > combining classes for the normalization of some categories of
    > text, we should accept to sacrifice the unification of
    > characters, each time it will cause a problem, or Unicode and
    > ISO10646 should accept to define/assign a generic codepoint
    > with class "Mn", CC=0, whose only role will be to bypass the
    > currently assigned non-zero CC value of combining characters,
    > even if, temporarily, this causes some problems for text
    > rendering engines (which can be corrected later to consider
    > this character as ignorable for all rendering purpose,
    > including searches of possible ligatures).
    > I suggest that such codepoint be allocated in the U+03XX
    > block for generic combining characters, so that it can be
    > used in any script, including the existing ones. This
    > character would be named "Combining Variant Selector" (CVS),
    > it would preserve the semantic of the diacritic to which it
    > is prefixed, and it would not override the current semantic
    > of the "Combining Grapheme Joiner" (CGJ) that may have
    > specific usage to create ligatures between diacritics, and
    > that should still continue to be canonically ordered, so that
    > if the diacritic <A> has a CC=a and diacritic <B> has a CC=b,
    > and if (a < b), the sequence <A,CGJ,B> would be valid, but
    > not <B, CGJ, A> unless the combining class of A is overriden
    > with <B, CGJ, CVS, A>.
    > This definition preserves the current semantic of the CGJ
    > (without extending it too much in a way that was not intended
    > when it was defined), and it makes possible to define
    > combining classes for the most usual cases of an encoded
    > script, without compromizing the future, if more rare texts
    > are discovered where the first unification works violate the
    > old text semantics for normalization.

    This archive was generated by hypermail 2.1.5 : Wed Jul 02 2003 - 10:39:04 EDT