From: Philippe Verdy (firstname.lastname@example.org)
Date: Fri Jun 27 2003 - 05:46:56 EDT
On Friday, June 27, 2003 3:54 AM, Kenneth Whistler <email@example.com> wrote:
> > At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
> > > Why is making use of the existing behavior of existing characters
> > > a "groanable kludge", if it has the desired effect and makes
> > > the required distinctions in text? If there is not some
> > > rendering system or font lookup showstopper here, I'm inclined
> > > to think it's a rather elegant way out of the problem.
> > I think assumptions about not breaking combining mark sequences
> > may, in fact, be a showstopper. If <base+mark+mark> becomes
> > <base+mark+CtrlChar+mark>, it is reasonable to think that this will
> > not only inhibit mark re-ordering but also mark combining and mark
> > interraction. Unfortunately, this seems to be the case with every
> > control character I have been able to test, using two different
> > rendering engines (Uniscribe and InDesign ME -- although the latter
> > already has some problems with double marks in Biblical Hebrew).
> > Perhaps we should have a specific COMBINING MARK SEQUENCE CONTROL
> > character?
> Actually, in casting around for the solution to the problem of
> introduction of format controls creating defective combining
> character sequences, it finally occurred to me that:
> U+034F COMBINING GRAPHEME JOINER
> has the requisite properties.
> It is non-visible, does not affect the display of neighboring
> characters (except incidentally, if processes choose to recognize
> sequences containing it and process them distinctly), *AND*
> it is a *combining mark*, not a format control.
> Hence, the sequence:
> <lamed, patah, CGJ, hiriq>
> 0 17 0 14
> is *not* a defective combining character sequence, by the
> definitions in the standard. The entire sequence of three
> combining marks would have to "apply" to the lamed, but
> the fact that CGJ has (cc=0) prevents the patah from reordering
> around the hiriq under normalization.
> Could this finally be the missing "killer ap" for the CGJ?
It will be perfect to allow an application like XML to encode Hebrew
text using Unicode 4.0 rules (and before).
It's not difficult, in existing (already encoded) Biblic Hebrew text to be
automatically corrected by inserting such character in the text, so that
it can be processed with existing technologies that expect a NF* form
or will create such intermediate form during the processing.
Combining Grapheme Joiner is still a hint that can be inserted within
any sequence, to create a ligature between two characters normally
considered (semantically) as separate. This means that any font or
renderer that cannot find a glyph for the ligated form will ignore the
The only new thing is the CGJ was only occuring before a base
character, and it was still not intended to occur before other
combining characters, despite it is a combining character itself.
I do think that CGJ has a combining class 0 only to avoid having it
moved within the combining sequence when applying a NF*
normalization, exactly because this would void its effect on the
selection of a ligature if it is not placed immediately before the
base character that follows it.
So this CGJ is acting in many places as a variant selector for the
next base character, which is modified to ligate it with the previous
grapheme cluster. Other characters in other scripts have similar
behavior to modify the glyph that follows them: Anusvaras, Visargas,
Nuktas, except that they are not considered as bombining
characters because they can occur at the beginning of the encoding
of the combining sequence...
Is CGJ correctly interpreted with UCA collation rules and keys? This
would be important, as Biblic texts are very often used with plain text
> If CGJ is the one, then the only *real* implementation requirement
> would be that CGJ be consistently inserted (for Biblical Hebrew)
> between any pair of points applied to the same consonant. Depending
> on the particular application, this could either be hidden behind the
> input method/keyboard and be actively managed by the software, or
> it could be applied as a filter to an export format, when exporting
> to contexts that might neutralize intended contrasts or result in
> the wrong display by the application of normalization.
For now, it's a good "trick" to transport safely these texts. It will require
some works in Hebrew fonts so that they ignore the CGJ character when
looking up combining sequences, but normally a font does not have to
be constrained by a normalization process, which should be performed
earlier in input methods or during the transport of text, but not at the final
step for rendering.
So the "superfluous" CGJ should be simply removed, when appropriate,
by the font renderer engine (Uniscribe on Windows), when there is no
specific glyph available for the N-to-1 sequence.
The current use of CGJ is for sequences like:
<b>+<o, CGJ>+<e>+<u>+<f> and <e>+<f, CGJ>+<f>+<e>+<t>
which still encode the French words "boeuf" and "effet", where the author
gives a hint to display the sequence "oe" as a single ligated form instead
of two separate grapheme clusters, despite this corresponds to the
"classic" semantic of such sequence, including for collation/sorting.
Its use is then typographic, and enhance the readability.
In fact, if you look at the "oe" or "ae" ligatures in French, they are never
handled as single letters like in other languages (that's why even the
French keyoard doesnot include them, as this ligature is only an implicit
"correct" typography). This means that the "ae" ligature, in French, can
always be safely decomposed as <a, CGJ>+< e> in a font renderer, without
loosing any semantic, and if the ligature is not present in the selected font,
the CGJ will be ignored and the individual glyphs for <a> and <e> will be
The same would be true for the German ess-tset ligature, decomposable
(not canonically, but effectively for German use) as <long s, CGJ>+< s>,
and then displayable as <long s>+<s>, or <s>+<s> if the glyph for
<long s> is also missing in a font, considering that <long s> is a glyph
variant of the same letter <s> which is preencoded in Unicode instead
of requiring some <VS1, s> combining sequence.
For the case of Biblic Hebrew, the extended use of multiple cantillation
marks on the same consonnant base letter works effectively as an
ordered ligature, more than unordered combining marks, normally
introduced in Unicode/ISO10646 only to avoid the explosion of the
repertoire with all possible accentuated letters whose interpretation
vary a lot across languages.
This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 06:37:38 EDT