L2/03-297

Hebrew Issues - DRAFT 2003-08-24

Jony Rosenne, rosennej@qsm.co.il, August 24, 2003.

1. Background

Recently, the Unicode list has been active with discussions of problems relating to Hebrew in general and to Biblical Hebrew in particular.

I suggest that before any solutions are devised or any changes to Unicode proposed, a comprehensive list of all the issues should be prepared. This is my draft.

In the following text, the word Bible refers to the Hebrew book that is also known as the Old Testament. The term marks includes cantillation marks and other Hebrew marks.

1.1 Biblical Hebrew

The Bible says: "thou shalt not add thereto, nor diminish from it" (Dt 13.1). This has been taken to mean that the text of the Bible must be preserved exactly as it was received, with much care and great precision, including any irregularity. As a result, we have to deal with situations such as Qere and Ketiv, where the points and marks, which were added to the text centuries later, do not match the letters of the text, or where various marks are located in unusual positions.

Students of the Bible, be they Jewish or Christian, religious or secular, take great care to preserve and analyze the precise location of the various points and marks, and any other irregularities.

The text of the Bible may be considered to consist of two layers: The first layer is the unpointed letters of the text, the second layer are the various points and marks added to the letters. The scrolls, from which the Bible is read in every synagogue, contain the letters only. Quotations in the Talmud and in other religious writings are generally unpointed. Printed Bibles or manuscript Bibles contain the letters and the points and marks.

The commandment "thou shalt not add thereto, nor diminish from it" is considered to address both the unpointed and the pointed Bible. When a scribe writes pointed text, he first writes the letters, then adds the points and marks. Sometimes they are hard to fit, so they tend to overflow towards the neighboring letters. Some marks are considered less important than others. The location of most marks is defined in relation to its base letter, although some cantillation marks are defined in relation to the word as a whole.

1.2 Manuscripts

A manuscript, by its very nature, is different from a printed book. The scribe draws by hand each letter and mark. Being human, he sometimes makes mistakes, he sometimes has his own preferences, his pen sometimes slips, and generally the outcome is significantly less uniform than a printed book.

Biblical scholars are endeavoring to encode such ancient manuscripts, with all their variances, with great precision. This occasionally demands precise control over the location of the marks, beyond that which may be achieved by Unicode and beyond the scope of plain text encoding. While I appreciate and support the efforts of Biblical scholars to achieve an electronic replication of manuscripts, I believe that some of the issues that have been raised should be resolved by higher level protocols such as mark-up.

1.3 Positioning of Hebrew Points and Marks

From TUS4.0 8.1:

Positioning. Marks may combine with vowels and other points, and there are complex typographic rules for positioning these combinations.

The vertical placement (meaning above, below, or inside) of points and marks is very well defined. The horizontal placement (meaning left, right, or center) of points is also very well defined. The horizontal placement of marks, on the other hand, is not well defined, and convention allows for the different placement of marks relative to their base character.

When points and marks are located below the same base letter, the point always comes first (on the right) and the mark after it (on the left), except for the marks yetiv, U+059A HEBREW ACCENT YETIV, and dehi, U+05AD HEBREW ACCENT DEHI, which come first (on the right) and are followed (on the left) by the point.

These rules are followed when points and marks are located above the same base letter:

The first two paragraphs are correct, and it is a pity they were not left alone. I don't think it is the business of Unicode to specify these complex typographic rules. But since we started with it, we have to address a number of exceptions.

2. The Issues

2.1 Vav Holam

There are two different cases of the letter Vav with the point Holam. It could be a Holam Male, where the Vav is mute and the letter together with the point represent the vowel, or it could be the letter Vav with a Holam Haser, where the Vav is the consonant and the point is the vowel.

In fine printing these two cases are distinguished. In the first case, Holam Male, the point is top center or top right, in the second case top left. Actually, a Holam Haser on the Vav is identical in this respect to a Holam on any other letter, which is always a Holam Haser.

As an example, in Genesis 4:13 one can see them both side by side, in the words gadol avoni. The Holam on the Vav in avoni (Ayin Vav Nun Yod) is a regular Holam Haser, top left. The Holam on the Vav in gadol is part of a Holam Male, and is positioned top center or top right.

Both cases are supposed to be encoded in Unicode as Vav Holam. In order to make the visual distinction, several people have adopted various stratagems, such that together with specific font designs the desired visual effect is achieved. Some put the Holam of the Holam Male before the Vav, some suggest the use of ZWNJ, CGJ or ZWJ in various combinations.

The result is an interchange incompatibility problem. This is a plain text issue, and should be addressed by the UTC.

2.2 Holam Alef

A related problem has been raised concerning the Holam Haser followed by the letter Alef. Often, the Holam point is printed above the right hand side of the Alef. It is shifted from the top left of the preceding (to the right) letter to the top right of the Alef as a typographical convention. This is normally done when the Alef is not pronounced.

The rules concerning this case are fairly straightforward, and it is feasible for the rendering engine to figure them out.

A possible solution is to use ZWNJ to inhibit the shifting of the Holam forward in the rare cases where shifting is not wanted although it is indicated by the rules. For example, Bet Holam Dagesh ZWNJ Alef.

2.3 Grammar Books

In grammar books and other texts discussing the Hebrew script there may arise a need to render various marks in isolation, without a visible base character.

I understand the Unicode does provide a solution, as this problem is not unique to Hebrew. However, since the suggested invisible base character is not an RTL character, it has neutral directionality, and an RLM may be needed.

2.4 Private Use Area

The private use area characters, which are not defined by Unicode in any other way, are defined to have left-to-right directionality. This prevents their use in Hebrew and Arabic.

I suggest that a small area, either in the PUA block or somewhere else, be defined as an RTL PUA.

2.5 Qere and Ketiv, Yerushala(y)im

In many cases the written and the spoken text are different. It is customary to use the letters of the written text (Ketiv) with the vowels of the spoken text (Qere). When the letters and the marks are based on different texts, various problems arise. In some cases there are fewer letters in the spoken text, so it appears that some of the letters of the written text have no points, in other cases the number of letters agrees but the points are ungrammatical. In more complicated cases there are more vowels and marks than there are letters, in a few cases there are no letters at all.

A common case is the name of the city Jerusalem, Yerushala(y)im, and its derivative Yerushala(y)ma, mostly spelled in the Bible without the second Yod, the Hiriq or Shva vowel of the missing Yod squeezed between the Lamed and the Mem, sometimes giving the illusion that the Lamed has two vowels.

In general, mark-up should be used to provide two alternative texts. I don't believe it is possible or reasonable to computerize all the possibilities that are afforded the scribe when he manually places the points and marks of the Qere on a shorter Ketiv.

For simpler cases, such as Yerushala(y)im, a zero width invisible base character could be used. Various possibilities had been discussed. CGJ is not appropriate because it is not a base character. ZWNBSP would have been suitable, except that it has been taken over by the BOM.

2.6 Furtive Patah

In many cases, a Patah vowel under a final Het, He or Ayin is pronounced before them, and this is indicated in fine printing by a slight shift of the Patah to the right.

Since the rules to distinguish the Furtive are simple and straightforward, i.e. this is a straightforward case of rendering, it was decided at the SII that a special character is not needed.

2.7 Meteg and Siluq

Unicode, following the SII, has unified the Meteg and the Siluq because they look the same and are easy to distinguish, as Siluq always appears before a Sof Pasuq.

The standard position of both the Meteg and the Siluq is to the left the vowel. In some cases the Meteg is written on the right hand side of the vowel. With Hataf vowels, some printers place the Meteg in the middle of the Hataf.

In some editions, the Meteg on the right indicates it was added by the editor and does not appear in the manuscript.

The medial Meteg in the Hataf vowels could be a rendering issue, a combining marks ligature. However, in this case we would need a CGNJ when a left Meteg is needed together with a Hataf.

For the right Meteg, a new character is needed. Whether it should be in the PUA or a general use Unicode is open. A private convention by the editor of a single book, however important, indicates the PUA. If other uses are common, then it could be a Unicode character.

2.8 Combining Classes

When a Hebrew text is normalized according to Unicode normalization rules, the combining marks are not ordered according to the convenience of some rendering engines.

It has been stated, however, that this is not the purpose of the combining classes, and that the rendering engine should, in this case, reorder the combining marks according to its preferences as part of the rendering process.

2.9 Inverted Nun

In the Bible there are a few cases of a special mark known as "Inverted Nun". It is probably not an inverted letter Nun, and requires its own character, HEBREW MARK INVERTED NUN.

2.10 Extraordinary Points

The SII encoded only the upper extraordinary point, as 05C4 HEBREW MARK UPPER DOT. A character for the lower dot could be added, although it appears only a few times.

Since there was a misunderstanding concerning 05C4 HEBREW MARK UPPER DOT, a note should be added to the code chart "= Upper Punctum Extraordinarium".

2.11 Broken Letters

There are in the text of the Bible a few instances of the mutilated or broken letters Vav and Qof. I suggest this could be handled by mark-up.

2.12 Number Dots

An old practice was to use dots and double dots above to distinguish "non words", such as numbers and acronyms. For several centuries this usage has been replaced by the use of Geresh and Gershayim.

The dots always appear on unpointed texts. There is nothing special about them, so the existing Unicodes 0307 and 0308 could be used.

2.13 Shva Na vs. Shva Nah

The Hebrew vowel Shva has two meanings, known as Shva Na and Shva Nah. Some printers desire to make the difference visible.

This is analogous to similar issues in other languages, for example the dual meaning of s in the English word summers, and should be handled by mark-up.

2.14 Qamats Gadol vs. Qamats Qatan

The Hebrew vowel Qamats has two meanings, known as Qamats Gadol and Qamats Qatan. Some printers desire to make the difference visible.

This is analogous to similar issues in other languages, for example the dual meaning of s in the English word summers, and should be handled by mark-up.

2.15 Vav with Dagesh vs. Shuruq

The Hebrew vowel Shuruq looks exactly like a Vav with Dagesh. Unicode, following the SII, unified them.

Some people want to see a code for the Vav Shuruq, considering it a separate vowel. Since there is no known typographical difference I see no reason to do so.

2.16 Hiriq Male

A vowel Hiriq followed by a silent Yod is called Hiriq Male.

Some people want to see a code for Hiriq Male, considering it a separate vowel. Since there is no known typographical difference I see no reason to do so.

3. References

Issues in the Representation of Pointed Hebrew in Unicode, Second draft, Peter Kirk, August 2003, http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.doc http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html.

Meteg and combining grapheme joiner [Whistler, L2/03-235, L2/03-236; Nelson, L2/03-234].

A threat to the integrity of Hebrew [Kirk, L2/03-227].

Proposal to Encode Alternative Characters for Biblical Hebrew [Peter Constable, L2/03-195].