Re: [hebrew] Hebrew Issues

From: Peter Kirk (
Date: Sat Aug 23 2003 - 18:13:43 EDT

  • Next message: Jony Rosenne: "RE: [hebrew] Re: Proposed Draft UTR #31 - Syntax Characters"

    On 23/08/2003 13:05, Jony Rosenne wrote:

    >Dear colleagues,
    >I have posted at a draft of
    >my summary of the several issues concerning Hebrew that had been discussed
    Thank you, Jony. This seems to me a fair overview of the issues. I
    comment below on specific points, interleaved in a snipped plain text
    copy of your draft.

    > Hebrew Issues - DRAFT
    > Jony Rosenne, <>, August
    > 23, 2003.
    > 1. Background
    > Recently, the Unicode list has been active with discussions of
    > problems relating to Hebrew in general and to Biblical Hebrew in
    > particular.
    > I suggest that before any solutions are devised or any changes to
    > Unicode proposed, a comprehensive list of all the issues should be
    > prepared. This is my draft.
    > In the following text, the word /Bible/ refers to the Hebrew book that
    > is also known as the Old Testament. The term /marks/ includes
    > cantillation marks and other Hebrew marks.
    Perhaps it should be clarified here that Unicode distinguishes by their
    names three types of combining character in the Hebrew block:

    1. POINTS - vowel points, dagesh, meteg, rafe, shin and sin dots.
    2. ACCENTS - this is the Unicode name for cantillation marks.
    3. MARKS - masora circle, upper dot.

    I assume by "marks" you refer to groups 2 and 3 as you refer separately
    to "points".

    > ...
    > 1.2 Manuscripts
    > A manuscript, by its very nature, is different from a printed book.
    > The scribe draws by hand each letter and mark. Being human, he
    > sometimes makes mistakes, he sometimes has his own preferences, his
    > pen sometimes slips, and generally the outcome is significantly less
    > uniform than a printed book.
    > Biblical scholars are endeavoring to encode such ancient manuscripts,
    > with all their variances, with great precision. This occasionally
    > demands precise control over the location of the marks, beyond that
    > which may be achieved by Unicode and beyond the scope of plain text
    > encoding. While I appreciate and support the efforts of Biblical
    > scholars to achieve an electronic replication of manuscripts, I
    > believe that some of the issues that have been raised should be
    > resolved by higher level protocols such as mark-up.
    I think here we need to distinguish between systematic and accidental
    variances. I agree that accidental variances should not be encoded in
    Unicode, and that systematic ones where there is contextual consistency
    should be handled by fonts etc rather than Unicode. I have been looking
    primarily at differences found not in manuscripts but in scholarly
    printed editions, in which the accidental variability of manuscripts has
    been eliminated and the systematic distinctions have been represented
    according to a consensus of respected textual scholars. I have been
    concerned to identify the systematic distinctions which these scholars
    have considered significant enough to distinguish in the printed text,
    and to suggest means of encoding and rendering these systematic
    distinctions in Unicode. See my document referred to below.

    > 1.3 Positioning of Hebrew Points and Marks
    > From TUS4.0 8.1:
    > ...
    > The first two paragraphs are correct, and it is a pity they were not
    > left alone. I don't think it is the business of Unicode to specify
    > these complex typographic rules. But since we started with it, we have
    > to address a number of exceptions.
    Agreed. Or we could omit this section from the next version of the
    standard and move them and expand them into a technical report or note,
    where the exceptions can be addressed in detail.

    > 2. The Issues
    > 2.1 Vav Holam
    > ...
    > The result is an interchange incompatibility problem. This is a plain
    > text issue, and should be addressed by the UTC.

    > 2.2 Holam Alef
    > A related problem has been raised concerning the Holam Haser followed
    > by the letter Alef. Often, the Holam point is printed above the right
    > hand side of the Alef. It is shifted from the top left of the
    > preceding (to the right) letter to the top right of the Alef as a
    > typographical convention. This is normally done when the Alef is not
    > pronounced.
    > Although the rules concerning this case are fairly straightforward,
    > the rendering engine should not need to know so much grammar.
    I'm a little surprised, Jony, that you came to this conclusion. It seems
    to me that this one is a rendering issue. You have argued before that in
    most typesetting this shift is not made. It has been demonstrated (in
    Ezra SIL and SBL Hebrew with Uniscribe) that it is feasible for a
    rendering engine to implement these rules, in the cases where this shift
    is required for high quality e.g. biblical publications. The biblical
    text already contains sufficient information to guide the rendering
    engine, except possibly for a few special cases, and in the spirit of
    "thou shalt not add thereto" I prefer not to do so when, as here, it is
    not absolutely necessary.

    > A possible solution is to use ZWJ to indicate the shifting of the
    > Holam forward. For example, Bet Dagesh Holam ZWJ Alef.
    Agreed, if a mechanism is required. My preference is to use this
    encoding only for special cases where the shift takes place as an
    exception to the regular rules, and to use ZWNJ instead of ZWJ to
    inhibit such shifting in cases where it is not required.

    By the way, your example is not in canonical order (although it is in
    logical order, see my comments on 2.8 below), and will be reordered to
    <bet, holam, dagesh, ZWJ, alef>.

    > 2.3 Grammar Books
    > In grammar books and other texts discussing the Hebrew script there
    > may arise a need to render various marks in isolation, without a
    > visible base character.
    > I understand the Unicode does provide a solution, as this problem is
    > not unique to Hebrew. However, since the suggested invisible base
    > character is not an RTL character, it has neutral directionality, and
    > an RLM may be needed.
    Agreed. Use of RLM has the extra advantage of inhibiting unwanted
    contraction of multiple spaces by higher level protocols.

    > 2.4 Private Use Area
    > The private use area characters, which are not defined by Unicode in
    > any other way, are defined to have left-to-right directionality. This
    > prevents their use in Hebrew and Arabic.
    > I suggest that a small area, either in the PUA block or somewhere
    > else, be defined as an RTL PUA.
    Good idea! Or would it be adequate to suggest that RLM be inserted
    before each PUA character? Would that make them right-to-left?

    > 2.5 Qere and Ketiv, Yerushala(y)im
    > ...
    > In general, mark-up should be used to provide two alternative texts. I
    > don't believe it is possible or reasonable to computerize all the
    > possibilities that are afforded the scribe when he manually places the
    > points and marks of the Qere on a shorter Ketiv.
    I think this is reasonable. At least Unicode fonts should not be
    expected to render such things correctly. But I can see that some will
    want to try to encode the mixed text form as it appears on the page. One
    way to do so would be to use the sequence <RLM, NBSP> as a base
    character around which the Qere points and marks can be arranged. (RLM
    is necessary here to ensure correct directionality.) But Unicode should
    not expect to guarantee correct rendering. And there is no need to
    specify this in the standard.

    > For simpler cases, such as Yerushala(y)im, a zero width invisible base
    > character could be used. Various possibilities had been discussed. CGJ
    > is not appropriate because it is not a base character. ZWNBSP would
    > have been suitable, except that it has been taken over by the BOM.
    I fail to see a good reason not to use CGJ in such a case. The Unicode
    distinction between a base character and a combining character is a
    technical one which does not need to align perfectly with every user's

    The exceptional case in Exodus 20:4 of two points under one base
    character where there is no omitted letter can also be dealt with well
    using CGJ.

    > 2.6 Furtive Patah
    > In many cases, a Patah vowel under a final Het, Alef or He is
    > pronounced before them, and this is indicated in fine printing by a
    > slight shift of the Patah to the right.
    > Since the rules to distinguish the Furtive are simple and
    > straightforward, i.e. this is a straightforward case of rendering, it
    > was decided at the SII that a special character is not needed.
    Agreed. This is a rendering issue.

    > 2.7 Meteg and Siluq
    > Unicode, following the SII, has unified the Meteg and the Siluq
    > because they look the same and are easy to distinguish, as Siluq
    > always appears before a Sof Pasuq.
    > The standard position of both the Meteg and the Siluq is to the left
    > the vowel. In some cases the Meteg is written on the right hand side
    > of the vowel. With Hataf vowels, some printers place the Meteg in the
    > middle of the Hataf.
    Not just printers, this appears in MSS as well.

    > In some editions, the Meteg on the right indicates it was added by the
    > editor and does not appear in the manuscript.
    But in other cases it does appear in the manuscript. BHS, the standard
    scholarly edition in western countries, follows the Leningrad codex in
    meteg positioning. See for example the attached from this codex, Genesis
    8:6 (taken from
    There are also several right metegs visible in the extract from a Lisbon
    codex of 1492 at,
    in the repeated "and there was evening and there was morning" in Genesis
    1 - interestingly, more of them than there are in BHS, but not all
    metegs are to the right.

    > The medial Meteg in the Hataf vowels could be a rendering issue, a
    > combining marks ligature. However, in this case we would need a CGNJ
    > when a left Meteg is needed together with a Hataf.
    In the absence of a CGNJ, and since CGJ does not have defined joining
    properties despite its misleading name, I have suggested using CGJ for this.

    > For the right Meteg, a new character is needed. Whether it should be
    > in the PUA or a general use Unicode is open. A private convention by
    > the editor of a single book, however important, indicates the PUA. If
    > other uses are common, then it could be a Unicode character.
    This is not a matter of a single book. I have identified three Bible
    editions (BHS, BHK, and Baer as reported by GKC (Gesenius, Kautzsh,
    Cowley) 16g) and two manuscripts which use right meteg as a distinctive
    positioning. Anyway, I would have concerns about the principle "A
    private convention by the editor of a single book, however important,
    indicates the PUA" in a case where electronic editions of this book are
    expected to be used and quoted by a worldwide community of thousands and
    extensively on the Internet, in domains where interchange of PUA
    characters has not been agreed.

    But I disagree that a new character is needed. This is essentially an
    alternative positioning of the same combining character relative to
    other combining characters with which it interferes typographically.
    This should have been dealt with by appropriate allocation of combining
    classes. As it was not, the appropriate mechanism seems to be to use CGJ
    to inhibit canonical reordering. Thus my suggestion (= indicates
    canonical equivalence):

    left meteg (non-hataf vowel): <vowel, meteg> = <meteg, vowel>
    right meteg: <meteg, CGJ, vowel>
    medial meteg (hataf vowel): <vowel, meteg> = <meteg, vowel>
    left meteg (hataf vowel): <vowel, CGJ, meteg>

    > 2.8 Combining Classes
    > When a Hebrew text is normalized according to Unicode normalization
    > rules, the combining marks are not ordered according to the
    > convenience of some rendering engines.
    > It has been stated, however, that this is not the purpose of the
    > combining classes, and that the rendering engine should, in this case,
    > reorder the combining marks according to its preferences as part of
    > the rendering process.
    Agreed, but this is only part of the story. Different combining classes
    have been assigned to points which do interfere typographically, and
    this is causing several problems. Also the canonical ordering is
    illogical e.g. consonant modifiers (sin/shin dot, dagesh, rafe) are in
    canonical order separated from the consonants they modify by the vowels
    which logically follow; it is not the order used instinctively in typing
    or by Jony in writing the example in 2.2 above. This causes problems
    with collation which can only be fixed by defining hundreds of contractions.

    But I understand that it is not possible to fix the errors which were
    originally made in defining these classes.

    > 2.9 Inverted Nun
    > In the Bible there are a few cases of a special mark known as
    > "Inverted Nun". It is probably not an inverted letter Nun, and
    > requires its own character, HEBREW MARK INVERTED NUN.

    > 2.10 Extraordinary Points
    > The SII encoded only the upper extraordinary point, as 05C4 HEBREW
    > MARK UPPER DOT. A character for the lower dot could be added, although
    > it appears only a few times.
    Agreed. Although this latter character is rare, it is in regular and
    undisputed use in a widely used text, and so probably does need to be

    > 2.11 Broken Letters
    > There are in the text of the Bible a few instances of the mutilated or
    > broken letters Vav and Qof. I suggest this could be handled by mark-up.
    Perhaps. The problem is that known mark-up languages have as far as I
    know no mechanisms for handling requests for variant glyphs. But Unicode
    does have such a mechanism, variation selectors. This could be a case
    where it would be suitable to use them.

    > 2.12 Number Dots
    > An old practice was to use dots and double dots above to distinguish
    > "non words", such as numbers and acronyms. For several centuries this
    > usage has been replaced by the use of Geresh and Gershayim.
    > The dots always appear on unpointed texts. There is nothing special
    > about them, so the existing Unicodes 0307 and 0308 could be used.

    > 2.13 Shva Na vs. Shva Nah
    > The Hebrew vowel Shva has two meanings, known as Shva Na and Shva Nah.
    > Some printers desire to make the difference visible.
    > This is analogous to similar issues in other languages, for example
    > the dual meaning of s in the English word summers, and should be
    > handled by mark-up.
    It seems to me that this is more analogous to the diacritics added to
    English words in some dictionaries etc to indicate and disambiguate
    their pronunciation, which can be encoded in Unicode. And again this is
    not something which any known mark-up can handle. So, at least if this
    is at all a regular practice and the glyphs used are at all
    standardised, a good case can be made for encoding a second separate
    combining character here, or possibly using a variation selector. If it
    is not at all standardised, "A private convention by the editor of a
    single book ... indicates the PUA."

    > 2.14 Qamats Gadol vs. Qamats Qatan
    > The Hebrew vowel Qamats has two meanings, known as Qamats Gadol and
    > Qamats Qatan. Some printers desire to make the difference visible.
    > This is analogous to similar issues in other languages, for example
    > the dual meaning of s in the English word summers, and should be
    > handled by mark-up.
    Same comment as on 2.13.

    > 2.15 Vav with Dagesh vs. Shuruq
    > The Hebrew vowel Shuruq looks exactly like a Vav with Dagesh. Unicode,
    > following the SII, unified them.
    > Some people want to see a code for the Vav Shuruq, considering it a
    > separate vowel. Since there is no known typographical difference I see
    > no reason to do so.
    I agree. But according to GKC p.55 note 2 there should actually be a
    typographical difference: "/Wāw/ with /Dageš/ (וּ) cannot in our printed
    texts be distinguished from /wāw/ pointed as /Šûrĕq/ (וּ); in the latter
    case the point should stand higher up."

    > 2.16 Hiriq Male
    > A vowel Hiriq followed by a silent Yod is called Hiriq Male.
    > Some people want to see a code for Hiriq Male, considering it a
    > separate vowel. Since there is no known typographical difference I see
    > no reason to do so.

    > 3. References
    > Issues in the Representation of Pointed Hebrew in Unicode, Second
    > draft, Peter Kirk, August 2003,
    I intend to make some minor updates to this, which I will post at the
    same location and perhaps also as a PDF.

    > ...
    You have not mentioned the following issue which I identified - the
    following is an extract from my document:

    > 2.6. Punctuation issues
    > Certain Hebrew punctuation marks are not correctly described in
    > Unicode 4.0.
    > /Sof pasuq/ is used to indicate the end of a verse in the Hebrew Bible
    > (although it is missing from the end of a few verses in some texts,
    > and completely absent from some others) and as the equivalent of a
    > full stop in other Hebrew writings such as prayer books. It should be
    > classed and processed as Terminal_Punctuation and also as a character
    > which typically terminates a sentence.
    > /Paseq/ is also used only at the ends of words, and so should also be
    > classed as Terminal_Punctuation, but not as terminating a sentence.
    > /Paseq/ has two uses, one as part of the Hebrew accent system and the
    > other as a special textual mark in the Hebrew Bible; it is normally
    > found only in the Hebrew Bible and in quotations from it.
    > /Maqaf/ is also generally considered to be a word divider and so
    > should also be classed as Terminal_Punctuation. As its usage is
    > analogous to that of /hyphen/ and line breaks commonly occur after it
    > in pointed Hebrew texts, it should also be listed in Unicode Standard
    > Annex #14, along with /hyphen/, as a “break opportunity after”.

    I hope the above will help you in revising and completing your draft

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Sun Aug 24 2003 - 00:34:11 EDT