L2/03-227 Date: Fri, 18 Jul 2003 12:26:18 -0700 (PDT) Subject: A threat to the integrity of the Hebrew language? Rick, I think this contribution by Peter Kirk should be given an L2 document number and be added to the agenda for the discussion of SIL's proposal for Hebrew. There is a very significant analysis here, with details about actual occurrences of the problem, and strong support for the alternative position that I had advocated for the problem. --Ken ------------- Begin Forwarded Message ------------- Date: Fri, 18 Jul 2003 08:49:06 -0700 To: Biblical Hebrew , Hebrew Computing list , Biblical Languages No, I am not making allegations of another anti-Semitic or anti-Israel plot. There are no Hamans to be unmasked and hanged. The people involved have the best of intentions, to provide proper computer support for biblical Hebrew. But what they propose has the effect of requiring biblical Hebrew to be encoded on computers in a different and incompatible way from modern Hebrew. That is, they are proposing to erect an artificial barrier between biblical and modern Hebrew. The proposal will also disable existing Hebrew software from handling the biblical text. But I hope the subject line caught your attention for this important issue. As this proposal has powerful backing including from Microsoft, there is a real chance that it might be accepted, although it has also run into some criticism. It is likely to be considered formally at a meeting in late August. Some of those who will be involved in the discussions have specifically asked for more input, especially from Israeli and Jewish scholars. So if any of you have any comments to make, either for or against this proposal, please speak up now. I will pass on any comments I receive (unless requested otherwise) to those who are considering the proposal, via the Unicode list, or more personally if requested or if that seems more appropriate to me. I can withhold names if requested, but attributable comments will be given greater weight. (I am sending this to several mailing lists and by blind copy to a number of individual Hebrew experts with whom I have contact, plus a few others who are likely to be interested. Apologies if anyone receives unwelcome multiple copies.) THE PROPOSAL The proposal I am referring to has been made to the technical committee of the Unicode consortium (www.unicode.org) which defines character encoding standards for every language of the world, which also become official ISO international standards and are used in most recent software products and on many Internet pages. The proposal is likely to be considered formally at the next meeting of this committee in late August. In the last month there has been some critical discussion of it and alternative suggestions on the Unicode list, see http://www.unicode.org/consortium/distlist.html. The proposal itself, "Proposal to Encode Alternative Characters for Biblical Hebrew", can be found at http://scripts.sil.org/cms/sites/nrsi/media/BibHebAltCharsProposal.pdf, and is of course rather technical. So I have written a less technical summary of the proposal and the justification for it, and included it below. The essence of the proposal is to define a partially separate set of Unicode characters for biblical Hebrew, in addition to the existing codes for Hebrew consonants, vowel points (nikud), other points and cantillation marks (teamim). The proposed new characters are BH versions of the eleven vowel points and of meteg/silluq, shin dot and sin dot. The intention is that biblical texts should be encoded with these new vowels and points, and that the existing vowel and point encodings should be used only for modern (or at least non-biblical) Hebrew, Yiddish etc pointed texts. Encoding of consonants, other points and accents is not affected. The proposers claim that there are no alternative solutions, but this is not true. There are at least two other ways, each with some variations, in which these problems could be solved. The first is to correct the inadequacies in the existing Unicode specification; this has unfortunately been ruled out, very definitely although the grounds for it are quite spurious (it is claimed that a change will invalidate existing text, but the only text which will be invalidated is existing biblical text which which will be far more seriously invalidated by the newly proposed changes). But there is a second alternative way which would require much less adjustment to existing text. I have outlined this below, after the less technical summary. PROBABLE CONSEQUENCES Here are some probable consequences if the proposal is accepted and the encoding of biblical Hebrew texts is adjusted accordingly: On the positive side: It will be possible for biblical Hebrew text to be input, stored, processed and displayed completely, accurately and efficiently on computer-based systems and on the Internet. Some existing minor problems with this will be ironed out. On the negative side: Most existing software and operating systems which input, process and display both modern and biblical Hebrew, though with some minor problems, will become unable to input, process or properly display biblical Hebrew at all. Some such software may be fixable by providing new fonts, keyboard drivers etc, but there is no guarantee of this. Much of it (e.g. Israeli Windows 98 and anything else which depends on CP1255) will almost certainly not be fixed. Similarly, existing fonts will not support biblical Hebrew. Software and font providers may be unwilling or very slow to upgrade their products to meet the new standard. Newly encoded biblical texts will be stored in a form incompatible with non-biblical Hebrew texts, and with biblical texts already encoded according to the current standard. This will confuse information search and retrieval systems, and confuse users who will not understand why the same searches will not find the same words in biblical and in modern texts. It is unclear which encoding system should be used for biblical passages quoted in more recent Hebrew texts. In such cases, at least in the absence of very clear rules, documents are likely to be produced in which the two different systems are mixed in an unpredictable way, and which therefore cannot be searched reliably. It is unclear which system should be used for ancient but non-biblical Hebrew and Aramaic texts. LESS TECHNICAL SUMMARY OF THE PROPOSAL This is a less technical summary of "Proposal to Encode Alternative Characters for Biblical Hebrew", http://scripts.sil.org/cms/sites/nrsi/media/BibHebAltCharsProposal.pdf. The essence of the proposal is to define a partially separate set of Unicode characters for biblical Hebrew, in addition to the existing codes for Hebrew consonants, vowel points (nikud), other points and cantillation marks (teamim). The proposed new characters are BH versions of the eleven vowel points and of meteg/silluq, shin dot and sin dot. The intention is that biblical texts should be encoded with these new vowels and points, and that the existing vowel and point encodings should be used only for modern (or at least non-biblical) Hebrew, Yiddish etc pointed texts. Encoding of consonants, other points and accents is not affected. The justification for this is that the current Unicode standard is in some minor ways inadequate for representation of biblical Hebrew. The problem is that Unicode texts are commonly "normalised" or converted into a standard form, for example when a text is converted to XML, which is rapidly becoming the standard format for document storage, or when it is prepared for Internet publishing. During this process certain distinctions which are considered to have no significance are lost. In particular, diacritical marks attached to the same base character are generally reordered into a standard "canonical" order, and only certain ordering distinctions are considered significant and preserved. According to Unicode as currently defined, all ordering distinctions between Hebrew vowels and other points (though not cantillation marks) are considered insignificant and not preserved. But some of these orderings actually do have significance in biblical Hebrew, though only in a few rather rare cases. This problem arises in the following circumstances: 1) Where there are two vowel points on a single consonant - something which occurs in the BHS Hebrew Bible text, as encoded by Michigan-Claremont and WTS, only (see also a posting I made to several lists on 24 October 2001): (a) 636 times in the shortened form of Yerushalayim (Jerusalem), where the yod before the final mem is lost and the word ends lamed - patah or qamets - hireq - mem (including four cases of this with a directional he suffix, in which the hireq is replaced by sheva) - the problem here is that normalisation will reorder hireq or sheva before patah or qamets; (b) in just two other places: in Exodus 20:4 the word MITTAXAT has a qamets and a patah under the second consonant, which will be normalised in the opposite order - this indicates alternative pronunciations of the word related to the double accentuation of the Ten Commandments, and it seems to be unique in the whole Hebrew Bible, at least in BHS as encoded by WTS; (c) in 2 Chronicles 13:14 there is the form *MAX:ACOC:IRYM, with sheva followed by hireq (the correct canonical order, as it happens, so this is a lesser problem) - this is actually a reconstructed Ketiv form, and looks to me to be a mistake for *MAX:ACOC:RIYM, although I have heard that the WTS editors say that it is intentional, presumably to reflect irregular placement of the vowel points in the MS (but since they have apparently added the holem and moved the sheva to regularise their reconstruction, it seems odd that they did not regularise the position of the hireq as well). There are some other cases of multiple vowels on one consonant found in printed Bibles, including BHS, but these are not included in the WTS text because they are cases of Ketiv and Qere, and WTS has not attempted to reproduce the text as written (Ketiv consonants with Qere vowels) but has provided separately the Qere with the vowels in the text and the Ketiv with a reconstructed vocalisation. (Mechon Mamre has taken a similar approach.) But it has not done this in cases of "perpetual Qere" which apparently include the short form of Yerushalayim. Anyone who wants to represent the biblical text as actually written in Unicode etc will come across a number of other cases of more than one vowel on a consonant, e.g. the following list (WTS encoded) which is far from exhaustive, based on an article by Haralambous at http://genepi.louis-jean.com/omega/biblical-hebrew94.pdf: 2 Samuel 22:8 WAIT:G.F(A$ (patah - hiriq) 1 Kings 9:18 W:)ET-T.A:MOR (patah - sheva) 2 Kings 5:25 M")AIN (patah - hiriq) 2 Kings 9:15 L:AG.YD (sheva - patah - not hatef patah here) Jeremiah 18:23 W:IH:YW. (sheva - hiriq) Ezekiel 25:9 W:QIR:YFTF:MFH (qamets - sheva) Ezekiel 46:19 B.AY.AR:KFTAIM (patah - hiriq) Daniel 2:9 HIZ::MIN:T.W.N (sheva - sheva, Aramaic) Not included here are some even more difficult cases, such as missing consonants at the start of the word, and words whose consonants are completely missing (Qere without Ketiv). (There are also 583 cases of furtive patah immediately following another vowel point - patah will be normalised correctly after hireq but incorrectly before holam and qibbuts (I note that this problem is not unique to biblical Hebrew as furtive patah is used, though not always written, in Hebrew of all periods); but in Unicode and more generally furtive patah is encoded after the word final consonant although it is pronounced before it, and so the diacritic ordering problem does not arise.) Note that in all but one of the 1228 cases listed above, including furtive patah, in the pronunciation there is an unwritten syllable break between the two vowels. Except for furtive patah, this syllable break corresponds to a consonant in the Qere but not in the Ketiv. 2) Where meteg precedes a vowel or splits a hataf vowel. The commonest position for meteg is to the left of any low vowel point, and this corresponds to the canonical ordering which put meteg after any vowel. But in BHS/WTS meteg occurs to the right of a low vowel about 700 times (M-C code 95), and meteg splits a hataf vowel 78 times (code 35); these numbers are likely to vary considerably in different editions of the text. There is an additional issue that the canonical ordering of multiple diacritics following a base character is said to be inefficient for rendering engines. This is the only justification given for the proposed separate biblical encoding of shin and sin dots. AN ALTERNATIVE PROPOSAL Define a break character (either an existing Unicode character or one to be defined) as a non-spacing break character, which acts as a base character and so inhibits canonical reordering. One possibly existing character is U+200D ZERO WIDTH NON JOINER; a better candidate in Unicode terms, as proposed by Ken Whistler on the Unicode list on 27 June (see http://www.unicode.org/mail-arch/unicode-ml/y2003-m06/0407.html), is U+034F COMBINING GRAPHEME JOINER (CGJ). Encode the end of the short form of Jerusalem as lamed - patah/qamets - BREAK - hireq - final mem. In this case this alternative solution actually seems to work already at least on my current Microsoft implementation, if U+200D is used for the break character. Encode meteg to the right of a vowel as meteg - BREAK - vowel. Encode meteg in the middle of a hataf vowel as hataf vowel - BREAK - meteg. (The commonest case, meteg to the left of a vowel, will continue to be encoded simply vowel - meteg.) Note that this change of encoding can be made transparent to the user, who will not need to type the break character and will not see it displayed. Its effect will be only on internal processing, specifically in inhibiting the reordering which is currently specified by Unicode. Although in some sense the BREAK character is a replacement for a missing syllable dividing consonant, it should probably not be though of as such by users. This proposal can be varied in several ways without impacting its validity and effectiveness. As for the alleged inefficiency of the canonical ordering, the problem here is more with efficient implementation of the rendering engine, as the overhead of sorting into the desired order a string of very rarely more than four diacritics should be trivial. And the Unicode standard implies that a rendering engine should do this sorting. Anyway, such inefficiencies tend to become irrelevant very quickly as computer technology advances, more quickly than international standards can be changed. Please let me know how you react to these two proposals and which general approach you prefer. I will pass on any comments I receive (unless requested otherwise) to those who are considering the proposal, via the Unicode list, or more personally if requested or if that seems more appropriate to me. -- Peter Kirk peter.r.kirk@ntlworld.com http://web.onetel.net.uk/~peterkirk/