Re: Latin ligatures and Unicode

From: Mark E. Davis (
Date: Wed Dec 22 1999 - 12:39:53 EST

The line-break controls (ZWSP, ZWNBSP) are defined in the standard to be
orthogonal to the cursive joining controls (ZWJ, ZWNJ). They are to be ignored
when processing cursive joining. For example, if I had
<space><HEH><ZWNBSP><plus-sign>, the <HEH> would have an independent form; the
ZWNBSP does not cause it to adopt the initial form. Instead, it forbids a
linebreak at that point. (By the way, ZWSP and ZWNBSP would better have been
named ZWLB and ZWNLB: allow/forbid linebreak).

If we do add ligature controls (ZWL, ZWNL), then they would be also orthogonal
to cursive joining, since you may have letters that cursively join that do not
ligate, and vice versa.

The other controls under discussion in the UTC are grapheme controls (ZWG,
ZWNG), which would be used to join/break graphemes for the purpose of analysis
(e.g. collation). An example would be 'ch' for Slovak. It is unclear as yet
whether these could be subsumed under the ligature controls -- e.g. whether
ligatures are forbidden essentially when they blur grapheme boundaries.

"Reynolds, Gregg" wrote:

> > -----Original Message-----
> > From: John Cowan []
> > Sent: Wednesday, December 22, 1999 9:22 AM
> >
> > "Reynolds, Gregg" wrote:
> >
> > > But is that an entirely accurate description of the
> > semantics of ZWJ in
> > > Arabic? All ZWJ means with respect to joining is "thou
> > shalt join";
> >
> > Actually, not. See below.
> Right; doesn't force a join where none would occur in its absence. How
> about "Thou shall join, unless you object to that sort of thing."
> > > As an example of how ZWJ could be put to good use with no notion of
> > > "deceiving the renderer", consider that lexigraphic words in Arabic
> > > frequently contain multiple lexemes. For example, kitAbuhA
> > = kitAbu, book,
> > > + suffix hA, of her = her book. In the absence of a proper
> > codepoint with
> > > "LEXEME DELIMITER" semantics, I can use ZWJ to provide such
> > semantics
> > > without affecting the rendering and search/sort behavior of
> > standard Unicode
> > > software:
> >
> > You *can* do so, but that is not the standard use of ZWJ.
> > ZWNBSP would probably
> > serve you better.
> But with ZWNBSP, we have no semantics with respect to joining behavior, or
> if we do it's well-hidden. Page 6-131 says it "behaves" like U+00A0
> NO-BREAK SPACE "in that it indicates the absence of word boundaries;
> however, the former has no width." But unless I'm mistaken the relation of
> word boundaries and joining behavior is not addressed; nor could it be,
> without a more refined notion of "word". "No word boundary" in English will
> probably mean no extra space is added, and line-breaking algorithms will not
> try to break there. Joining doesn't enter the picture. One could do the
> same thing in an Arabic string, but interpret ZWNBSP as forcing terminal
> forms on its neighbors. It's not clear to me at least that such an
> interpretation would violate the definitions in Unicode; indeed that could
> be a legitimate use of it, e.g. for purposes of illustrating joining
> behavior. If that's the case, then I can't be sure that conformant software
> will agree on the joining behavior.
> But more to the point, I would argue that "use" and "interpretation" are and
> should be distinct. An encoding should provide a semantics, not usage
> guidelines. So long as one's usage of a codepoint harmonizes with its
> formal semantics, or at least does not violate it (them?), everything should
> be hunky-dory (that's 'ok' in Standard English).
> And to complicate things even further, in some cases Arabic ligated forms
> can be interupted by line breaks or quote marks (guillemets usually).
> Haven't worked out how that would be encoded, but I think ZWJ would be
> required on either side of the interrupting char. What is the proper
> interpretation of ZWNBSP in such a case? It's purpose appears to be the
> inhibition of line breaks, but this also involves ligature breaks (using
> "ligature" in the strick sense of "tie").
> Example: li-al-HayAt, "to Al-HayAt", as in "Write to Al-HayAt for more
> info" (it's a newspaper). To indicate that al-HayAt is a proper name, you
> enclose it in guillemets or some other typographic quoting figures; since
> "li-al" is ligated, this means you have to break the join. "li-" then takes
> initial form, as does the following alif of "-al". "li" and "al" also
> happen to be distinct lexemes, so we want them both demarcated as such. How
> would you encode that, both with and without the guillemets?
> I suspect I could come up with examples where ZWNBSP could divide a single
> lexigraphic word into two parts, both of which could be interpreted as
> distinct lexigraphic words, in which case an implementation could either
> join or not join and still get readable Arabic.
> > > In this example, ZWJ falls between two characters of the
> > joining class; it
> > > has no effect on their form, and the ligation is formed.
> >
> > Then there is no point in it, at least not according to the
> > standard definitions.
> See above; semantics doesn't (shouldn't) address issues of utility.
> > > Or, define it as a non-printing character of the
> > > dual-joining class.
> >
> > This is probably the best definition. ZWNJ, then is an
> > invisible character
> > of the non-joining class.
> >
> > > I guess I'd have to differ with you on this interpretation. Seems
> > > reasonable to me to talk of joining forms of just about
> > anything _within a
> > > local context_. Where 'fi' ligatures exist, the 'f' of the
> > ligated form is
> > > not the same form as isolated 'f'.
> >
> > Agreed. I think the trouble comes from the word JOINER in
> > the names of ZWJ
> > and ZWNJ. These characters do not "join" anything; rather,
> > they provoke
> > the shaping of surrounding characters by creating a pseudo-context.
> I think I agree with this, but my inner hair-splitter is telling me that
> they do in fact join (where appropriate); but since they're typographically
> null, we just can't see it. ;) And actually, I think a formal semantics
> might well express it something like that.
> > > While we're at it, we also need a way to stretch the space
> > between two
> > > adjacent Arabic letterforms that don't join, but without
> > introducing word
> > > separation. Tatweel would work just fine if marking
> > semantics were made
> > > dependent on syntactic context - i.e. it should not be considered
> > > "join-causing"; it's semantics should simply be "stretch
> > whatever's there,
> > > be it whitespace or a ligating stroke."
> >
> > That is the function of NBSP.
> Same problem with the relation of joining, spacing, and word boundaries.
> Might be just an issue of making what's implicit explicit: place NBSP (and
> ZWNBSP) in the dual-join category. Then it has the same semantics as ZWJ,
> with NB added.
> > > Also needed: a means of placing diacritics over null space
> > - e.g. over space
> > > or a ligating stroke. ZWJ would be good for this, except
> > for the part about
> > > zero width. Anyway, that's a subject for a different
> > thread and I gotta get
> > > back to the grindstone.
> >
> > The convention is to put the diacritic following SP; NBSP
> > would work equally
> > well, sometimes better.
> Same observation as above. What does it mean to put a space of any kind
> between two ligated letterforms? And how does that relate to tatweel?
> In the end I suspect we need a few more codepoints specifically designed to
> handle such issues for Arabiform text.
> Thanks,
> -gregg

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT