Re: Yerushala(y)im - or Biblical Hebrew

From: Karljürgen Feuerherm (cuneiform@rogers.com)
Date: Sat Jul 26 2003 - 09:13:58 EDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Poetry in motion"

    I believe this to the wrong outlook.

    The real situation is that the real text has no Yod--deliberately so from a
    Masoretic standpoint. No invisible Yod should be inserted to 'emend' the
    text.

    (Note that I am not making a pietistic argument, I'm not the least bit
    pietistic, though I suspect there are Biblical scholars who would take that
    view. I'm simply making a text faithfulness argument.)

    K
    ----- Original Message -----
    From: "Jony Rosenne" <rosennej@qsm.co.il>
    To: <unicode@unicode.org>
    Sent: Saturday, July 26, 2003 2:24 AM
    Subject: RE: Yerushala(y)im - or Biblical Hebrew

    > This explanation makes me unhappy with CGJ.
    >
    > Ken says: "The important things are that it is a) invisible, b) a
    combining
    > mark, and c) has combining class zero".
    >
    > And: "There is no need for an invisible base character here".
    >
    > On the contrary, to represent the text we do need an invisible base
    > character for the Hiriq, representing the unwritten Yod.
    >
    > Another possibility is to encode the Yod with a complex text (in the
    meaning
    > non plain text) control saying the Yod is invisible.
    >
    > I think it is important, whatever solution is chosen, to represent the
    real
    > situation, rather than just a sequence of codes that happens to be able to
    > produce the desired visual output.
    >
    > Jony
    >
    > > -----Original Message-----
    > > From: unicode-bounce@unicode.org
    > > [mailto:unicode-bounce@unicode.org] On Behalf Of Kenneth Whistler
    > > Sent: Saturday, July 26, 2003 2:40 AM
    > > To: ted@newslate.com
    > > Cc: unicode@unicode.org; kenw@sybase.com
    > > Subject: Re: Yerushala(y)im - or Biblical Hebrew
    > >
    > >
    > > Ted continued:
    > >
    > > > If I recall correctly, the suggestion for using CGJ for
    > > yerushala(y)im
    > > > was to encode it as: <...lamed, patah, cgj, hiriq, final
    > > mem>. Also, I
    > > > seem to recall that this gave some people heartburn because CGJ was
    > > > not intended to join two combining characters. What if this
    > > case were
    > > > encoded as: <...lamed, patah, cgj, zwnbs, hiriq, final
    > > mem>? (Please
    > > > forgive me if this is what had been proposed all along.)
    > > >
    > > > As I understand it from reading the description of CGJ (and
    > > ignoring
    > > > for the moment that zwnbs has no visible glyph and is
    > > general category
    > > > Cf), this is exactly what CGJ was designed for: treat the two base
    > > > characters on either side of the CGJ as a single grapheme for the
    > > > purpose of placing combining characters. This approach uses
    > > zero width
    > > > no-break space to represent the "missing letter"
    > > interpretation of the
    > > > two vowels pointed out by Jony Rosenne. Normalization
    > > wouldn't destroy
    > > > the ordering of the vowels, and Hebrew-aware software could
    > > be written
    > > > to do all this more-or-less transparently and automatically.
    > >
    > > Hmm. Some further clarifications are in order, since the
    > > documentation for both of these characters has not quite
    > > caught up to the UTC decisions regarding them. A lot of work
    > > went into the Unicode 4.0 documentation on these, and the
    > > Unicode 4.0 chapters will be posted online very soon -- at
    > > which point it would be helpful if everyone concerned about
    > > this issue takes the time to read the latest on these
    > > characters in particular.
    > >
    > > First, about ZWNBS (U+FEFF). Because of the confusing overlap
    > > of functionality of U+FEFF as the BOM (byte order mark) in
    > > the Unicode encoding schemes and as what its name, ZERO WIDTH
    > > NO-BREAK SPACE implies, the UTC (as of Unicode 3.2)
    > > standardized a separate character, U+2060 WORD JOINER. That
    > > character is described in UAX #14, Line Breaking Properties:
    > > http://www.unicode.org/reports/tr14/
    > > U+2060 is "the preferred choice for an invisible character to keep
    > > other characters together that would otherwise be split
    > > across the line at a direct break." U+FEFF retains that
    > > semantic, for backwards compatibility, but its preferred use
    > > is as the byte order mark only.
    > >
    > > So whether or not a line break format control character is
    > > relevant to the Biblical Hebrew vowel problem (and I don't
    > > think it is, actually), one should be talking about use of
    > > U+2060 WORD JOINER (WJ), rather than U+FEFF ZWNBS in any such
    > > new context.
    > >
    > > Second, there is U+034F COMBINING GRAPHEME JOINER (CGJ)
    > > itself. The impetus for encoding the CGJ at all was to have a
    > > plain text means of distinguishing, for example, an "ie"
    > > sequence that weights as two units for collation and an "ie"
    > > sequence that weights as a single unit for collation.
    > >
    > > During the debate about such an addition, the entity was
    > > called various things, but the moniker "GRAPHEME JOINER"
    > > caught on in the committee and stuck. There was also debate
    > > about an equal and opposite "GRAPHEME NON-JOINER", on the
    > > principle that inserting a GNJ between, e.g., a "ch" weighted
    > > as a unit, so as to force it to be treated as two units would
    > > be the more normal requirement in collation. However, the
    > > committee did not develop consensus that that was a required
    > > *character*, in part because insertion of *any* delimiting
    > > character in that context could be taken as having that
    > > effect or be tailored in collation to weight as desired to
    > > distinguish it from the digraphic unit, for example.
    > >
    > > The "COMBINING" became part of the CGJ's name when it
    > > became clear that the character should be given the
    > > General Category Mn, making it a combining mark, rather
    > > than General Category Cf to make it a format control.
    > >
    > > During this debate, high hopes were also placed on the
    > > COMBINING GRAPHEME JOINER as being the magic bullet for all
    > > kinds of things: it could "glue together" a pair of accents
    > > so that they would render side-by-side instead of using the
    > > default accent placement rules. It could also "glue together"
    > > sequences of characters into a "grapheme cluster", so that
    > > the grapheme cluster would become the target of an enclosing
    > > combining mark -- that would resolve the problem of how to
    > > get an enclosing circle to circle an arbitrary number, rather
    > > than just a single digit, for example.
    > >
    > > In the end, however, the inconsistent and troubling
    > > implications of this attempt at getting the Unicode
    > > Standard further involved in the monkey business of trying
    > > to be a glyph description language, rather than a character
    > > encoding, caused many second thoughts. And the UTC formally
    > > backed away from all those silver bullet aspects of CGJ. In
    > > Unicode 4.0, CGJ has been stripped of all interpretation
    > > except as an invisible mark which can be used to tailor
    > > collation (and searching), so as to distinguish digraphic
    > > units from sequences of the same characters.
    > >
    > > If you look at UAX #29, Text Boundaries, now, and in
    > > particular, Section 3, Grapheme Cluster Boundaries, you will
    > > see that CGJ has nothing to do with the definition of such
    > > boundaries. While it has the Grapheme_Link property (as do
    > > all the Indic viramas), Grapheme_Link is no longer even
    > > mentioned in UAX #29, and Grapheme_Link is nowhere else used,
    > > not even in a derived property.
    > >
    > > So the shorthand interpretation of CGJ currently is
    > > "invisible target for collation tailoring of neighboring
    > > characters into a digraphic unit." Even calling it by its
    > > formal name, COMBINING GRAPHEME JOINER, immediately conjures
    > > up the wrong connotations, so it is better to just use the
    > > CGJ acronym and not spell it out. Or think of CGJ as standing
    > > for "Collation kluGJe", if you wish. ;-)
    > >
    > > Now when you say:
    > >
    > > > If I recall correctly, the suggestion for using CGJ for
    > > yerushala(y)im
    > > > was to encode it as: <...lamed, patah, cgj, hiriq, final
    > > mem>. Also, I
    > > > seem to recall that this gave some people heartburn because CGJ was
    > > > not intended to join two combining characters.
    > >
    > > If people are getting "heartburn" because CGJ is not intended
    > > to join two combining characters, the problem they are having
    > > is the result of a misunderstanding of the intent here.
    > >
    > > It is *true* that the CGJ is no longer intended to "join two
    > > combining characters", although people tried for awhile to
    > > see if it would work to "glue together two combining
    > > characters" for different rendering.
    > >
    > > But the point of the CGJ proposal with respect to Biblical
    > > Hebrew is *not* to somehow sneak back around to interpreting
    > > the CGJ as gluing two combining characters together. Instead,
    > > it turns out that the CGJ, whose interpretation has been
    > > whittled down to being almost nothing, has the appropriate
    > > set of character *properties* to serve to block canonical
    > > reordering of a combining character sequence. The important
    > > things are that it is a) invisible, b) a combining mark, and
    > > c) has combining class zero. To serve the purpose of blocking
    > > the canonical ordering, it doesn't have to *do* anything but
    > > just sit there with its properties as defined. It doesn't
    > > "join" anything, and it doesn't have anything to do with the
    > > "grapheme" status of the resulting sequence.
    > >
    > > The only other Unicode characters with those properties are
    > > the variation selectors, but those characters *do* have
    > > cooccurrence constraints that prevent them from following a
    > > combining mark (at least in a legally interpretable way).
    > > That leaves the CGJ as the *only* Unicode character which has
    > > the desired properties and which has no constraints against
    > > occurrence in the middle of a combining character sequence.
    > >
    > > Another way of thinking of this is that in addition to CGJ
    > > being the "Collation kluGJe", it can be interpreted as the
    > > "Canonical Gradient Jigger", if we simply acknowledge the
    > > fact that, given its current properties, if it occurs in the
    > > relevant sequences of combining marks, it already has the
    > > effect of jiggering the canonical gradients to produce just
    > > the distinctions desired. ;-)
    > >
    > > > Of course, zwnbs is not a base character. If using zwnbs is
    > > a problem
    > > > (because it has no visible glyph and/or because it has
    > > category Cf),
    > > > then perhaps what is needed is another character (perhaps a
    > > new one)
    > > > that has no width or visible glyph but can be treated as a base
    > > > character (category Lo). That may be needed anyway, since
    > > some of the
    > > > boundary definitions have special rules for zwnbs.
    > >
    > > There is no need for an invisible base character here. That
    > > *would* be going further than is necessary to solve the
    > > problem, and would create arguments about the actual content
    > > of the text -- are we encoding an inherent consonant here or
    > > not? Why go there, when the problem is simply to represent
    > > the text as shown and then let commentators and phonologists
    > > argue about whether the yod is "really" there or not.
    > >
    > > > Ted
    > > >
    > > > P.S. It's two p's but only one d. :)
    > >
    > > Sorry. Anticipatory doubling, I guess...
    > >
    > > --Ken
    > >
    > >
    > >
    > >
    > >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat Jul 26 2003 - 10:15:08 EDT