L2/03-227

Date: Fri, 18 Jul 2003 12:26:18 -0700 (PDT)
Subject: A threat to the integrity of the Hebrew language?

Rick,

I think this contribution by Peter Kirk should be given an
L2 document number and be added to the agenda for the discussion
of SIL's proposal for Hebrew. There is a very significant
analysis here, with details about actual occurrences of
the problem, and strong support for the alternative position
that I had advocated for the problem.

--Ken

------------- Begin Forwarded Message -------------

Date: Fri, 18 Jul 2003 08:49:06 -0700
To: Biblical Hebrew <b-hebrew@lists.ibiblio.org>, Hebrew Computing list
<hebrewcomputing@yahoogroups.com>, Biblical Languages
<biblical-languages@lists.ibiblio.org>

No, I am not making allegations of another anti-Semitic or anti-Israel
plot. There are no Hamans to be unmasked and hanged. The people involved
have the best of intentions, to provide proper computer support for
biblical Hebrew. But what they propose has the effect of requiring
biblical Hebrew to be encoded on computers in a different and
incompatible way from modern Hebrew. That is, they are proposing to
erect an artificial barrier between biblical and modern Hebrew. The
proposal will also disable existing Hebrew software from handling the
biblical text.

But I hope the subject line caught your attention for this important
issue. As this proposal has powerful backing including from Microsoft,
there is a real chance that it might be accepted, although it has also
run into some criticism. It is likely to be considered formally at a
meeting in late August. Some of those who will be involved in the
discussions have specifically asked for more input, especially from
Israeli and Jewish scholars.

So if any of you have any comments to make, either for or against this
proposal, please speak up now. I will pass on any comments I receive
(unless requested otherwise) to those who are considering the proposal,
via the Unicode list, or more personally if requested or if that seems
more appropriate to me. I can withhold names if requested, but
attributable comments will be given greater weight.

(I am sending this to several mailing lists and by blind copy to a
number of individual Hebrew experts with whom I have contact, plus a few
others who are likely to be interested. Apologies if anyone receives
unwelcome multiple copies.)

THE PROPOSAL

The proposal I am referring to has been made to the technical committee
of the Unicode consortium (www.unicode.org) which defines character
encoding standards for every language of the world, which also become
official ISO international standards and are used in most recent
software products and on many Internet pages. The proposal is likely to
be considered formally at the next meeting of this committee in late
August. In the last month there has been some critical discussion of it
and alternative suggestions on the Unicode list, see
http://www.unicode.org/consortium/distlist.html. The proposal itself,
"Proposal to Encode Alternative Characters for Biblical Hebrew", can be
found at
http://scripts.sil.org/cms/sites/nrsi/media/BibHebAltCharsProposal.pdf,
and is of course rather technical. So I have written a less technical
summary of the proposal and the justification for it, and included it below.

The essence of the proposal is to define a partially separate set of
Unicode characters for biblical Hebrew, in addition to the existing
codes for Hebrew consonants, vowel points (nikud), other points and
cantillation marks (teamim). The proposed new characters are BH versions
of the eleven vowel points and of meteg/silluq, shin dot and sin dot.
The intention is that biblical texts should be encoded with these new
vowels and points, and that the existing vowel and point encodings
should be used only for modern (or at least non-biblical) Hebrew,
Yiddish etc pointed texts. Encoding of consonants, other points and
accents is not affected.

The proposers claim that there are no alternative solutions, but this is
not true. There are at least two other ways, each with some variations,
in which these problems could be solved. The first is to correct the
inadequacies in the existing Unicode specification; this has
unfortunately been ruled out, very definitely although the grounds for
it are quite spurious (it is claimed that a change will invalidate
existing text, but the only text which will be invalidated is existing
biblical text which which will be far more seriously invalidated by the
newly proposed changes). But there is a second alternative way which
would require much less adjustment to existing text. I have outlined
this below, after the less technical summary.

PROBABLE CONSEQUENCES

Here are some probable consequences if the proposal is accepted and the
encoding of biblical Hebrew texts is adjusted accordingly:

On the positive side:

It will be possible for biblical Hebrew text to be input, stored,
processed and displayed completely, accurately and efficiently on
computer-based systems and on the Internet. Some existing minor problems
with this will be ironed out.

On the negative side:

Most existing software and operating systems which input, process and
display both modern and biblical Hebrew, though with some minor
problems, will become unable to input, process or properly display
biblical Hebrew at all. Some such software may be fixable by providing
new fonts, keyboard drivers etc, but there is no guarantee of this. Much
of it (e.g. Israeli Windows 98 and anything else which depends on
CP1255) will almost certainly not be fixed. Similarly, existing fonts
will not support biblical Hebrew. Software and font providers may be
unwilling or very slow to upgrade their products to meet the new standard.

Newly encoded biblical texts will be stored in a form incompatible with
non-biblical Hebrew texts, and with biblical texts already encoded
according to the current standard. This will confuse information search
and retrieval systems, and confuse users who will not understand why the
same searches will not find the same words in biblical and in modern texts.

It is unclear which encoding system should be used for biblical passages
quoted in more recent Hebrew texts. In such cases, at least in the
absence of very clear rules, documents are likely to be produced in
which the two different systems are mixed in an unpredictable way, and
which therefore cannot be searched reliably.

It is unclear which system should be used for ancient but non-biblical
Hebrew and Aramaic texts.

LESS TECHNICAL SUMMARY OF THE PROPOSAL

This is a less technical summary of "Proposal to Encode Alternative
Characters for Biblical Hebrew",
http://scripts.sil.org/cms/sites/nrsi/media/BibHebAltCharsProposal.pdf.

The essence of the proposal is to define a partially separate set of
Unicode characters for biblical Hebrew, in addition to the existing
codes for Hebrew consonants, vowel points (nikud), other points and
cantillation marks (teamim). The proposed new characters are BH versions
of the eleven vowel points and of meteg/silluq, shin dot and sin dot.
The intention is that biblical texts should be encoded with these new
vowels and points, and that the existing vowel and point encodings
should be used only for modern (or at least non-biblical) Hebrew,
Yiddish etc pointed texts. Encoding of consonants, other points and
accents is not affected.

The justification for this is that the current Unicode standard is in
some minor ways inadequate for representation of biblical Hebrew. The
problem is that Unicode texts are commonly "normalised" or converted
into a standard form, for example when a text is converted to XML, which
is rapidly becoming the standard format for document storage, or when it
is prepared for Internet publishing. During this process certain
distinctions which are considered to have no significance are lost. In
particular, diacritical marks attached to the same base character are
generally reordered into a standard "canonical" order, and only certain
ordering distinctions are considered significant and preserved.
According to Unicode as currently defined, all ordering distinctions
between Hebrew vowels and other points (though not cantillation marks)
are considered insignificant and not preserved. But some of these
orderings actually do have significance in biblical Hebrew, though only
in a few rather rare cases.

This problem arises in the following circumstances:

1) Where there are two vowel points on a single consonant - something
which occurs in the BHS Hebrew Bible text, as encoded by
Michigan-Claremont and WTS, only (see also a posting I made to several
lists on 24 October 2001):

(a) 636 times in the shortened form of Yerushalayim (Jerusalem), where
the yod before the final mem is lost and the word ends lamed - patah or
qamets - hireq - mem (including four cases of this with a directional he
suffix, in which the hireq is replaced by sheva) - the problem here is
that normalisation will reorder hireq or sheva before patah or qamets;

(b) in just two other places: in Exodus 20:4 the word MITTAXAT has a
qamets and a patah under the second consonant, which will be normalised
in the opposite order - this indicates alternative pronunciations of the
word related to the double accentuation of the Ten Commandments, and it
seems to be unique in the whole Hebrew Bible, at least in BHS as encoded
by WTS;

(c) in 2 Chronicles 13:14 there is the form *MAX:ACOC:IRYM, with sheva
followed by hireq (the correct canonical order, as it happens, so this
is a lesser problem) - this is actually a reconstructed Ketiv form, and
looks to me to be a mistake for *MAX:ACOC:RIYM, although I have heard
that the WTS editors say that it is intentional, presumably to reflect
irregular placement of the vowel points in the MS (but since they have
apparently added the holem and moved the sheva to regularise their
reconstruction, it seems odd that they did not regularise the position
of the hireq as well).

There are some other cases of multiple vowels on one consonant found in
printed Bibles, including BHS, but these are not included in the WTS
text because they are cases of Ketiv and Qere, and WTS has not attempted
to reproduce the text as written (Ketiv consonants with Qere vowels) but
has provided separately the Qere with the vowels in the text and the
Ketiv with a reconstructed vocalisation. (Mechon Mamre has taken a
similar approach.) But it has not done this in cases of "perpetual Qere"
which apparently include the short form of Yerushalayim. Anyone who
wants to represent the biblical text as actually written in Unicode etc
will come across a number of other cases of more than one vowel on a
consonant, e.g. the following list (WTS encoded) which is far from
exhaustive, based on an article by Haralambous at
http://genepi.louis-jean.com/omega/biblical-hebrew94.pdf:

2 Samuel 22:8 WAIT:G.F(A$ (patah - hiriq)
1 Kings 9:18 W:)ET-T.A:MOR (patah - sheva)
2 Kings 5:25 M")AIN (patah - hiriq)
2 Kings 9:15 L:AG.YD (sheva - patah - not hatef patah here)
Jeremiah 18:23 W:IH:YW. (sheva - hiriq)
Ezekiel 25:9 W:QIR:YFTF:MFH (qamets - sheva)
Ezekiel 46:19 B.AY.AR:KFTAIM (patah - hiriq)
Daniel 2:9 HIZ::MIN:T.W.N (sheva - sheva, Aramaic)

Not included here are some even more difficult cases, such as missing
consonants at the start of the word, and words whose consonants are
completely missing (Qere without Ketiv).

(There are also 583 cases of furtive patah immediately following another
vowel point - patah will be normalised correctly after hireq but
incorrectly before holam and qibbuts (I note that this problem is not
unique to biblical Hebrew as furtive patah is used, though not always
written, in Hebrew of all periods); but in Unicode and more generally
furtive patah is encoded after the word final consonant although it is
pronounced before it, and so the diacritic ordering problem does not arise.)

Note that in all but one of the 1228 cases listed above, including
furtive patah, in the pronunciation there is an unwritten syllable break
between the two vowels. Except for furtive patah, this syllable break
corresponds to a consonant in the Qere but not in the Ketiv.

2) Where meteg precedes a vowel or splits a hataf vowel. The commonest
position for meteg is to the left of any low vowel point, and this
corresponds to the canonical ordering which put meteg after any vowel.
But in BHS/WTS meteg occurs to the right of a low vowel about 700 times
(M-C code 95), and meteg splits a hataf vowel 78 times (code 35); these
numbers are likely to vary considerably in different editions of the text.

There is an additional issue that the canonical ordering of  multiple
diacritics following a base character is said to be inefficient for
rendering engines. This is the only justification given for the proposed
separate biblical encoding of shin and sin dots.

AN ALTERNATIVE PROPOSAL

Define a break character (either an existing Unicode character or one to
be defined) as a non-spacing break character, which acts as a base
character and so inhibits canonical reordering. One possibly existing
character is U+200D ZERO WIDTH NON JOINER; a better candidate in Unicode
terms, as proposed by Ken Whistler on the Unicode list on 27 June (see
http://www.unicode.org/mail-arch/unicode-ml/y2003-m06/0407.html), is
U+034F COMBINING GRAPHEME JOINER (CGJ).

Encode the end of the short form of Jerusalem as lamed - patah/qamets -
BREAK - hireq - final mem. In this case this alternative solution
actually seems to work already at least on my current Microsoft
implementation, if U+200D is used for the break character.

Encode meteg to the right of a vowel as meteg - BREAK - vowel.

Encode meteg in the middle of a hataf vowel as hataf vowel - BREAK - meteg.

(The commonest case, meteg to the left of a vowel, will continue to be
encoded simply vowel - meteg.)

Note that this change of encoding can be made transparent to the user,
who will not need to type the break character and will not see it
displayed. Its effect will be only on internal processing, specifically
in inhibiting the reordering which is currently specified by Unicode.
Although in some sense the BREAK character is a replacement for a
missing syllable dividing consonant, it should probably not be though of
as such by users.

This proposal can be varied in several ways without impacting its
validity and effectiveness.

As for the alleged inefficiency of the canonical ordering, the problem
here is more with efficient implementation of the rendering engine, as
the overhead of sorting into the desired order a string of very rarely
more than four diacritics should be trivial. And the Unicode standard
implies that a rendering engine should do this sorting. Anyway, such
inefficiencies tend to become irrelevant very quickly as computer
technology advances, more quickly than international standards can be
changed.

Please let me know how you react to these two proposals and which
general approach you prefer. I will pass on any comments I receive
(unless requested otherwise) to those who are considering the proposal,
via the Unicode list, or more personally if requested or if that seems
more appropriate to me.

-- Peter Kirk peter.r.kirk@ntlworld.com http://web.onetel.net.uk/~peterkirk/