Re: Biblical Hebrew (U+034F Combining Grapheme Joiner works)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 27 2003 - 05:46:56 EDT

  • Next message: Michael Everson: "Re: Biblical Hebrew"

    On Friday, June 27, 2003 3:54 AM, Kenneth Whistler <kenw@sybase.com> wrote:

    > John,
    >
    > > At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
    > >
    > > > Why is making use of the existing behavior of existing characters
    > > > a "groanable kludge", if it has the desired effect and makes
    > > > the required distinctions in text? If there is not some
    > > > rendering system or font lookup showstopper here, I'm inclined
    > > > to think it's a rather elegant way out of the problem.
    > >
    > > I think assumptions about not breaking combining mark sequences
    > > may, in fact, be a showstopper. If <base+mark+mark> becomes
    > > <base+mark+CtrlChar+mark>, it is reasonable to think that this will
    > > not only inhibit mark re-ordering but also mark combining and mark
    > > interraction. Unfortunately, this seems to be the case with every
    > > control character I have been able to test, using two different
    > > rendering engines (Uniscribe and InDesign ME -- although the latter
    > > already has some problems with double marks in Biblical Hebrew).
    > > Perhaps we should have a specific COMBINING MARK SEQUENCE CONTROL
    > > character?
    >
    > Actually, in casting around for the solution to the problem of
    > introduction of format controls creating defective combining
    > character sequences, it finally occurred to me that:
    > U+034F COMBINING GRAPHEME JOINER
    > has the requisite properties.
    >
    > It is non-visible, does not affect the display of neighboring
    > characters (except incidentally, if processes choose to recognize
    > sequences containing it and process them distinctly), *AND*
    > it is a *combining mark*, not a format control.
    >
    > Hence, the sequence:
    > <lamed, patah, CGJ, hiriq>
    > 0 17 0 14
    > is *not* a defective combining character sequence, by the
    > definitions in the standard. The entire sequence of three
    > combining marks would have to "apply" to the lamed, but
    > the fact that CGJ has (cc=0) prevents the patah from reordering
    > around the hiriq under normalization.
    > Could this finally be the missing "killer ap" for the CGJ?

    It will be perfect to allow an application like XML to encode Hebrew
    text using Unicode 4.0 rules (and before).

    It's not difficult, in existing (already encoded) Biblic Hebrew text to be
    automatically corrected by inserting such character in the text, so that
    it can be processed with existing technologies that expect a NF* form
    or will create such intermediate form during the processing.

    Combining Grapheme Joiner is still a hint that can be inserted within
    any sequence, to create a ligature between two characters normally
    considered (semantically) as separate. This means that any font or
    renderer that cannot find a glyph for the ligated form will ignore the
    CGJ.

    The only new thing is the CGJ was only occuring before a base
    character, and it was still not intended to occur before other
    combining characters, despite it is a combining character itself.

    I do think that CGJ has a combining class 0 only to avoid having it
    moved within the combining sequence when applying a NF*
    normalization, exactly because this would void its effect on the
    selection of a ligature if it is not placed immediately before the
    base character that follows it.

    So this CGJ is acting in many places as a variant selector for the
    next base character, which is modified to ligate it with the previous
    grapheme cluster. Other characters in other scripts have similar
    behavior to modify the glyph that follows them: Anusvaras, Visargas,
    Nuktas, except that they are not considered as bombining
    characters because they can occur at the beginning of the encoding
    of the combining sequence...

    Is CGJ correctly interpreted with UCA collation rules and keys? This
    would be important, as Biblic texts are very often used with plain text
    searches.

    > If CGJ is the one, then the only *real* implementation requirement
    > would be that CGJ be consistently inserted (for Biblical Hebrew)
    > between any pair of points applied to the same consonant. Depending
    > on the particular application, this could either be hidden behind the
    > input method/keyboard and be actively managed by the software, or
    > it could be applied as a filter to an export format, when exporting
    > to contexts that might neutralize intended contrasts or result in
    > the wrong display by the application of normalization.

    For now, it's a good "trick" to transport safely these texts. It will require
    some works in Hebrew fonts so that they ignore the CGJ character when
    looking up combining sequences, but normally a font does not have to
    be constrained by a normalization process, which should be performed
    earlier in input methods or during the transport of text, but not at the final
    step for rendering.

    So the "superfluous" CGJ should be simply removed, when appropriate,
    by the font renderer engine (Uniscribe on Windows), when there is no
    specific glyph available for the N-to-1 sequence.

    The current use of CGJ is for sequences like:
    <b>+<o, CGJ>+<e>+<u>+<f> and <e>+<f, CGJ>+<f>+<e>+<t>
    which still encode the French words "boeuf" and "effet", where the author
    gives a hint to display the sequence "oe" as a single ligated form instead
    of two separate grapheme clusters, despite this corresponds to the
    "classic" semantic of such sequence, including for collation/sorting.
    Its use is then typographic, and enhance the readability.

    In fact, if you look at the "oe" or "ae" ligatures in French, they are never
    handled as single letters like in other languages (that's why even the
    French keyoard doesnot include them, as this ligature is only an implicit
    "correct" typography). This means that the "ae" ligature, in French, can
    always be safely decomposed as <a, CGJ>+< e> in a font renderer, without
    loosing any semantic, and if the ligature is not present in the selected font,
    the CGJ will be ignored and the individual glyphs for <a> and <e> will be
    selectable.

    The same would be true for the German ess-tset ligature, decomposable
    (not canonically, but effectively for German use) as <long s, CGJ>+< s>,
    and then displayable as <long s>+<s>, or <s>+<s> if the glyph for
    <long s> is also missing in a font, considering that <long s> is a glyph
    variant of the same letter <s> which is preencoded in Unicode instead
    of requiring some <VS1, s> combining sequence.

    For the case of Biblic Hebrew, the extended use of multiple cantillation
    marks on the same consonnant base letter works effectively as an
    ordered ligature, more than unordered combining marks, normally
    introduced in Unicode/ISO10646 only to avoid the explosion of the
    repertoire with all possible accentuated letters whose interpretation
    vary a lot across languages.



    This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 06:37:38 EDT