Re: Latin ligatures and Unicode

From: John Cowan (
Date: Wed Dec 22 1999 - 12:44:11 EST

"Reynolds, Gregg" wrote:

> But with ZWNBSP, we have no semantics with respect to joining behavior, or
> if we do it's well-hidden.

ZWNBSP has no effect on joining behavior, correct. You were saying
that kitAbuhA should be expressed as kitAbu+hA where + is some sort
of word boundary that doesn't involve whitespace. ZWNBSP is precisely
a word boundary that doesn't generate any whitespacem neither horizontal
nor vertical. The b and h would remain joined, but word-boundary
analysis would show two words here.

> But more to the point, I would argue that "use" and "interpretation" are and
> should be distinct. An encoding should provide a semantics, not usage
> guidelines.

Unicode has historically provided usage guidance, not semantics (still
less formal semantics).

> And to complicate things even further, in some cases Arabic ligated forms
> can be interupted by line breaks or quote marks (guillemets usually).
> Haven't worked out how that would be encoded, but I think ZWJ would be
> required on either side of the interrupting char.

I think not. Guillemets are non-joining characters, and so is LS/CR/LF/NL,
so preventing a join across the boundary would be automatic.

> Example: li-al-HayAt, "to Al-HayAt", as in "Write to Al-HayAt for more
> info" (it's a newspaper). To indicate that al-HayAt is a proper name, you
> enclose it in guillemets or some other typographic quoting figures; since
> "li-al" is ligated, this means you have to break the join. "li-" then takes
> initial form, as does the following alif of "-al". "li" and "al" also
> happen to be distinct lexemes, so we want them both demarcated as such. How
> would you encode that, both with and without the guillemets?

So you want a ZWNBSP between "al" and "HayAt" in any case, and between
"li" and "al" if there is no punctuation. Inserting the guillemets
should provoke the correct shaping results. No need for ZWJ or ZWNJ here.

> I suspect I could come up with examples where ZWNBSP could divide a single
> lexigraphic word into two parts, both of which could be interpreted as
> distinct lexigraphic words, in which case an implementation could either
> join or not join and still get readable Arabic.

To get non-joining behavior, you need both ZWNJ (for non-joining) and ZWNBSP
(for word separation).

> > > In this example, ZWJ falls between two characters of the
> > joining class; it
> > > has no effect on their form, and the ligation is formed.
> >
> > Then there is no point in it, at least not according to the
> > standard definitions.
> See above; semantics doesn't (shouldn't) address issues of utility.

It occurs to me at this point that we may be on different tracks.
ZWJ and ZWNJ are about *shaping* stricto sensu: they have to do
with whether initial, medial, final, or isolated forms are chosen.
Arabic ligatures as such are *not* affected.

To provoke an optional ligature (distinct from
simple shaping) would require something like the new ZWL.

> > > While we're at it, we also need a way to stretch the space between two
> > > adjacent Arabic letterforms that don't join, but without introducing word
> > > separation. Tatweel would work just fine if marking semantics were made
> > > dependent on syntactic context - i.e. it should not be considered
> > > "join-causing"; it's semantics should simply be "stretch whatever's there,
> > > be it whitespace or a ligating stroke."
> >
> > That is the function of NBSP.
> Same problem with the relation of joining, spacing, and word boundaries.
> Might be just an issue of making what's implicit explicit: place NBSP (and
> ZWNBSP) in the dual-join category. Then it has the same semantics as ZWJ,
> with NB added.

Every non-Arabic character (except ZWJ itself) is non-joining. The whole
notion of joining *across* visible whitespace makes little sense to me.
The function of NBSP is to create visible whitespace without a word boundary.
If you want a connecting line, use TATWEEL.

> What does it mean to put a space of any kind between two ligated letterforms?

SP is also a non-joining character. I thought you were asking about isolated
diacritics, which are represented by SP+diacritic.


Schlingt dreifach einen Kreis vom dies! || John Cowan <> Schliesst euer Aug vor heiliger Schau, || Denn er genoss vom Honig-Tau, || Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT