Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 25 2003 - 20:39:59 EDT

  • Next message: Tex Texin: "Re: [OT?] LCD/LED Keyboard"

    Ted continued:

    > If I recall correctly, the suggestion for using CGJ for yerushala(y)im was
    > to encode it as: <...lamed, patah, cgj, hiriq, final mem>. Also, I seem to
    > recall that this gave some people heartburn because CGJ was not intended to
    > join two combining characters. What if this case were encoded as: <...lamed,
    > patah, cgj, zwnbs, hiriq, final mem>? (Please forgive me if this is what had
    > been proposed all along.)
    >
    > As I understand it from reading the description of CGJ (and ignoring for the
    > moment that zwnbs has no visible glyph and is general category Cf), this is
    > exactly what CGJ was designed for: treat the two base characters on either
    > side of the CGJ as a single grapheme for the purpose of placing combining
    > characters. This approach uses zero width no-break space to represent the
    > "missing letter" interpretation of the two vowels pointed out by Jony
    > Rosenne. Normalization wouldn't destroy the ordering of the vowels, and
    > Hebrew-aware software could be written to do all this more-or-less
    > transparently and automatically.

    Hmm. Some further clarifications are in order, since the documentation
    for both of these characters has not quite caught up to the UTC
    decisions regarding them. A lot of work went into the Unicode 4.0
    documentation on these, and the Unicode 4.0 chapters will be posted
    online very soon -- at which point it would be helpful if everyone
    concerned about this issue takes the time to read the latest on
    these characters in particular.

    First, about ZWNBS (U+FEFF). Because of the confusing overlap of
    functionality of U+FEFF as the BOM (byte order mark) in the
    Unicode encoding schemes and as what its name, ZERO WIDTH NO-BREAK
    SPACE implies, the UTC (as of Unicode 3.2) standardized a separate
    character, U+2060 WORD JOINER. That character is described
    in UAX #14, Line Breaking Properties:
    http://www.unicode.org/reports/tr14/
    U+2060 is "the preferred choice for an invisible character to keep
    other characters together that would otherwise be split across
    the line at a direct break." U+FEFF retains that semantic, for
    backwards compatibility, but its preferred use is as the byte
    order mark only.

    So whether or not a line break format control character is
    relevant to the Biblical Hebrew vowel problem (and I don't think
    it is, actually), one should be talking about use of U+2060 WORD
    JOINER (WJ), rather than U+FEFF ZWNBS in any such new context.

    Second, there is U+034F COMBINING GRAPHEME JOINER (CGJ) itself.
    The impetus for encoding the CGJ at all was to have a
    plain text means of distinguishing, for example, an "ie"
    sequence that weights as two units for collation and an "ie"
    sequence that weights as a single unit for collation.

    During the debate about such an addition, the entity was called
    various things, but the moniker "GRAPHEME JOINER" caught on
    in the committee and stuck. There was also debate about
    an equal and opposite "GRAPHEME NON-JOINER", on the principle
    that inserting a GNJ between, e.g., a "ch" weighted as a unit,
    so as to force it to be treated as two units would be the more
    normal requirement in collation. However, the committee did
    not develop consensus that that was a required *character*,
    in part because insertion of *any* delimiting character in that
    context could be taken as having that effect or be tailored
    in collation to weight as desired to distinguish it from
    the digraphic unit, for example.

    The "COMBINING" became part of the CGJ's name when it
    became clear that the character should be given the
    General Category Mn, making it a combining mark, rather
    than General Category Cf to make it a format control.

    During this debate, high hopes were also placed on the
    COMBINING GRAPHEME JOINER as being the magic bullet for all kinds
    of things: it could "glue together" a pair of accents so
    that they would render side-by-side instead of using the
    default accent placement rules. It could also "glue together"
    sequences of characters into a "grapheme cluster", so that
    the grapheme cluster would become the target of an
    enclosing combining mark -- that would resolve the problem
    of how to get an enclosing circle to circle an arbitrary
    number, rather than just a single digit, for example.

    In the end, however, the inconsistent and troubling
    implications of this attempt at getting the Unicode
    Standard further involved in the monkey business of trying
    to be a glyph description language, rather than a character
    encoding, caused many second thoughts. And the UTC formally
    backed away from all those silver bullet aspects of CGJ.
    In Unicode 4.0, CGJ has been stripped of all interpretation
    except as an invisible mark which can be used to tailor
    collation (and searching), so as to distinguish digraphic units
    from sequences of the same characters.

    If you look at UAX #29, Text Boundaries, now, and in particular,
    Section 3, Grapheme Cluster Boundaries, you will see that
    CGJ has nothing to do with the definition of such boundaries.
    While it has the Grapheme_Link property (as do all the
    Indic viramas), Grapheme_Link is no longer even mentioned
    in UAX #29, and Grapheme_Link is nowhere else used, not even
    in a derived property.

    So the shorthand interpretation of CGJ currently is "invisible
    target for collation tailoring of neighboring characters into
    a digraphic unit." Even calling it by its formal name,
    COMBINING GRAPHEME JOINER, immediately conjures up the wrong
    connotations, so it is better to just use the CGJ acronym and
    not spell it out. Or think of CGJ as standing for "Collation kluGJe",
    if you wish. ;-)

    Now when you say:

    > If I recall correctly, the suggestion for using CGJ for yerushala(y)im was
    > to encode it as: <...lamed, patah, cgj, hiriq, final mem>. Also, I seem to
    > recall that this gave some people heartburn because CGJ was not intended to
    > join two combining characters.

    If people are getting "heartburn" because CGJ is not intended
    to join two combining characters, the problem they are having
    is the result of a misunderstanding of the intent here.

    It is *true* that the CGJ is no longer intended to "join two
    combining characters", although people tried for awhile to
    see if it would work to "glue together two combining characters"
    for different rendering.

    But the point of the CGJ proposal with respect to Biblical Hebrew
    is *not* to somehow sneak back around to interpreting the CGJ
    as gluing two combining characters together. Instead, it
    turns out that the CGJ, whose interpretation has been whittled
    down to being almost nothing, has the appropriate set of
    character *properties* to serve to block canonical reordering
    of a combining character sequence. The important things are
    that it is a) invisible, b) a combining mark, and c) has
    combining class zero. To serve the purpose of blocking
    the canonical ordering, it doesn't have to *do* anything but
    just sit there with its properties as defined. It doesn't
    "join" anything, and it doesn't have anything to do with
    the "grapheme" status of the resulting sequence.

    The only other Unicode characters with those properties are
    the variation selectors, but those characters *do* have
    cooccurrence constraints that prevent them from following
    a combining mark (at least in a legally interpretable
    way). That leaves the CGJ as the *only* Unicode character
    which has the desired properties and which has no constraints
    against occurrence in the middle of a combining character
    sequence.

    Another way of thinking of this is that in addition to CGJ
    being the "Collation kluGJe", it can be interpreted as
    the "Canonical Gradient Jigger", if we simply acknowledge
    the fact that, given its current properties, if it occurs
    in the relevant sequences of combining marks, it already
    has the effect of jiggering the canonical gradients to
    produce just the distinctions desired. ;-)

    > Of course, zwnbs is not a base character. If using zwnbs is a problem
    > (because it has no visible glyph and/or because it has category Cf), then
    > perhaps what is needed is another character (perhaps a new one) that has no
    > width or visible glyph but can be treated as a base character (category Lo).
    > That may be needed anyway, since some of the boundary definitions have
    > special rules for zwnbs.

    There is no need for an invisible base character here. That
    *would* be going further than is necessary to solve the
    problem, and would create arguments about the actual content
    of the text -- are we encoding an inherent consonant here or
    not? Why go there, when the problem is simply to represent
    the text as shown and then let commentators and phonologists
    argue about whether the yod is "really" there or not.

    > Ted
    >
    > P.S. It's two p's but only one d. :)

    Sorry. Anticipatory doubling, I guess...

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Jul 25 2003 - 21:22:09 EDT