RE: Yerushala(y)im - or Biblical Hebrew

From: Jony Rosenne (
Date: Sat Jul 26 2003 - 02:24:55 EDT

  • Next message: Peter Kirk: "Re: Yerushala(y)im - or Biblical Hebrew"

    This explanation makes me unhappy with CGJ.

    Ken says: "The important things are that it is a) invisible, b) a combining
    mark, and c) has combining class zero".

    And: "There is no need for an invisible base character here".

    On the contrary, to represent the text we do need an invisible base
    character for the Hiriq, representing the unwritten Yod.

    Another possibility is to encode the Yod with a complex text (in the meaning
    non plain text) control saying the Yod is invisible.

    I think it is important, whatever solution is chosen, to represent the real
    situation, rather than just a sequence of codes that happens to be able to
    produce the desired visual output.


    > -----Original Message-----
    > From:
    > [] On Behalf Of Kenneth Whistler
    > Sent: Saturday, July 26, 2003 2:40 AM
    > To:
    > Cc:;
    > Subject: Re: Yerushala(y)im - or Biblical Hebrew
    > Ted continued:
    > > If I recall correctly, the suggestion for using CGJ for
    > yerushala(y)im
    > > was to encode it as: <...lamed, patah, cgj, hiriq, final
    > mem>. Also, I
    > > seem to recall that this gave some people heartburn because CGJ was
    > > not intended to join two combining characters. What if this
    > case were
    > > encoded as: <...lamed, patah, cgj, zwnbs, hiriq, final
    > mem>? (Please
    > > forgive me if this is what had been proposed all along.)
    > >
    > > As I understand it from reading the description of CGJ (and
    > ignoring
    > > for the moment that zwnbs has no visible glyph and is
    > general category
    > > Cf), this is exactly what CGJ was designed for: treat the two base
    > > characters on either side of the CGJ as a single grapheme for the
    > > purpose of placing combining characters. This approach uses
    > zero width
    > > no-break space to represent the "missing letter"
    > interpretation of the
    > > two vowels pointed out by Jony Rosenne. Normalization
    > wouldn't destroy
    > > the ordering of the vowels, and Hebrew-aware software could
    > be written
    > > to do all this more-or-less transparently and automatically.
    > Hmm. Some further clarifications are in order, since the
    > documentation for both of these characters has not quite
    > caught up to the UTC decisions regarding them. A lot of work
    > went into the Unicode 4.0 documentation on these, and the
    > Unicode 4.0 chapters will be posted online very soon -- at
    > which point it would be helpful if everyone concerned about
    > this issue takes the time to read the latest on these
    > characters in particular.
    > First, about ZWNBS (U+FEFF). Because of the confusing overlap
    > of functionality of U+FEFF as the BOM (byte order mark) in
    > the Unicode encoding schemes and as what its name, ZERO WIDTH
    > NO-BREAK SPACE implies, the UTC (as of Unicode 3.2)
    > standardized a separate character, U+2060 WORD JOINER. That
    > character is described in UAX #14, Line Breaking Properties:
    > U+2060 is "the preferred choice for an invisible character to keep
    > other characters together that would otherwise be split
    > across the line at a direct break." U+FEFF retains that
    > semantic, for backwards compatibility, but its preferred use
    > is as the byte order mark only.
    > So whether or not a line break format control character is
    > relevant to the Biblical Hebrew vowel problem (and I don't
    > think it is, actually), one should be talking about use of
    > U+2060 WORD JOINER (WJ), rather than U+FEFF ZWNBS in any such
    > new context.
    > Second, there is U+034F COMBINING GRAPHEME JOINER (CGJ)
    > itself. The impetus for encoding the CGJ at all was to have a
    > plain text means of distinguishing, for example, an "ie"
    > sequence that weights as two units for collation and an "ie"
    > sequence that weights as a single unit for collation.
    > During the debate about such an addition, the entity was
    > called various things, but the moniker "GRAPHEME JOINER"
    > caught on in the committee and stuck. There was also debate
    > about an equal and opposite "GRAPHEME NON-JOINER", on the
    > principle that inserting a GNJ between, e.g., a "ch" weighted
    > as a unit, so as to force it to be treated as two units would
    > be the more normal requirement in collation. However, the
    > committee did not develop consensus that that was a required
    > *character*, in part because insertion of *any* delimiting
    > character in that context could be taken as having that
    > effect or be tailored in collation to weight as desired to
    > distinguish it from the digraphic unit, for example.
    > The "COMBINING" became part of the CGJ's name when it
    > became clear that the character should be given the
    > General Category Mn, making it a combining mark, rather
    > than General Category Cf to make it a format control.
    > During this debate, high hopes were also placed on the
    > COMBINING GRAPHEME JOINER as being the magic bullet for all
    > kinds of things: it could "glue together" a pair of accents
    > so that they would render side-by-side instead of using the
    > default accent placement rules. It could also "glue together"
    > sequences of characters into a "grapheme cluster", so that
    > the grapheme cluster would become the target of an enclosing
    > combining mark -- that would resolve the problem of how to
    > get an enclosing circle to circle an arbitrary number, rather
    > than just a single digit, for example.
    > In the end, however, the inconsistent and troubling
    > implications of this attempt at getting the Unicode
    > Standard further involved in the monkey business of trying
    > to be a glyph description language, rather than a character
    > encoding, caused many second thoughts. And the UTC formally
    > backed away from all those silver bullet aspects of CGJ. In
    > Unicode 4.0, CGJ has been stripped of all interpretation
    > except as an invisible mark which can be used to tailor
    > collation (and searching), so as to distinguish digraphic
    > units from sequences of the same characters.
    > If you look at UAX #29, Text Boundaries, now, and in
    > particular, Section 3, Grapheme Cluster Boundaries, you will
    > see that CGJ has nothing to do with the definition of such
    > boundaries. While it has the Grapheme_Link property (as do
    > all the Indic viramas), Grapheme_Link is no longer even
    > mentioned in UAX #29, and Grapheme_Link is nowhere else used,
    > not even in a derived property.
    > So the shorthand interpretation of CGJ currently is
    > "invisible target for collation tailoring of neighboring
    > characters into a digraphic unit." Even calling it by its
    > formal name, COMBINING GRAPHEME JOINER, immediately conjures
    > up the wrong connotations, so it is better to just use the
    > CGJ acronym and not spell it out. Or think of CGJ as standing
    > for "Collation kluGJe", if you wish. ;-)
    > Now when you say:
    > > If I recall correctly, the suggestion for using CGJ for
    > yerushala(y)im
    > > was to encode it as: <...lamed, patah, cgj, hiriq, final
    > mem>. Also, I
    > > seem to recall that this gave some people heartburn because CGJ was
    > > not intended to join two combining characters.
    > If people are getting "heartburn" because CGJ is not intended
    > to join two combining characters, the problem they are having
    > is the result of a misunderstanding of the intent here.
    > It is *true* that the CGJ is no longer intended to "join two
    > combining characters", although people tried for awhile to
    > see if it would work to "glue together two combining
    > characters" for different rendering.
    > But the point of the CGJ proposal with respect to Biblical
    > Hebrew is *not* to somehow sneak back around to interpreting
    > the CGJ as gluing two combining characters together. Instead,
    > it turns out that the CGJ, whose interpretation has been
    > whittled down to being almost nothing, has the appropriate
    > set of character *properties* to serve to block canonical
    > reordering of a combining character sequence. The important
    > things are that it is a) invisible, b) a combining mark, and
    > c) has combining class zero. To serve the purpose of blocking
    > the canonical ordering, it doesn't have to *do* anything but
    > just sit there with its properties as defined. It doesn't
    > "join" anything, and it doesn't have anything to do with the
    > "grapheme" status of the resulting sequence.
    > The only other Unicode characters with those properties are
    > the variation selectors, but those characters *do* have
    > cooccurrence constraints that prevent them from following a
    > combining mark (at least in a legally interpretable way).
    > That leaves the CGJ as the *only* Unicode character which has
    > the desired properties and which has no constraints against
    > occurrence in the middle of a combining character sequence.
    > Another way of thinking of this is that in addition to CGJ
    > being the "Collation kluGJe", it can be interpreted as the
    > "Canonical Gradient Jigger", if we simply acknowledge the
    > fact that, given its current properties, if it occurs in the
    > relevant sequences of combining marks, it already has the
    > effect of jiggering the canonical gradients to produce just
    > the distinctions desired. ;-)
    > > Of course, zwnbs is not a base character. If using zwnbs is
    > a problem
    > > (because it has no visible glyph and/or because it has
    > category Cf),
    > > then perhaps what is needed is another character (perhaps a
    > new one)
    > > that has no width or visible glyph but can be treated as a base
    > > character (category Lo). That may be needed anyway, since
    > some of the
    > > boundary definitions have special rules for zwnbs.
    > There is no need for an invisible base character here. That
    > *would* be going further than is necessary to solve the
    > problem, and would create arguments about the actual content
    > of the text -- are we encoding an inherent consonant here or
    > not? Why go there, when the problem is simply to represent
    > the text as shown and then let commentators and phonologists
    > argue about whether the yod is "really" there or not.
    > > Ted
    > >
    > > P.S. It's two p's but only one d. :)
    > Sorry. Anticipatory doubling, I guess...
    > --Ken

    This archive was generated by hypermail 2.1.5 : Sat Jul 26 2003 - 02:17:55 EDT