RE: Latin ligatures and Unicode

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Dec 22 1999 - 12:05:23 EST


> -----Original Message-----
> From: John Cowan [mailto:jcowan@reutershealth.com]
> Sent: Wednesday, December 22, 1999 9:22 AM
>
> "Reynolds, Gregg" wrote:
>
> > But is that an entirely accurate description of the
> semantics of ZWJ in
> > Arabic? All ZWJ means with respect to joining is "thou
> shalt join";
>
> Actually, not. See below.

Right; doesn't force a join where none would occur in its absence. How
about "Thou shall join, unless you object to that sort of thing."

> > As an example of how ZWJ could be put to good use with no notion of
> > "deceiving the renderer", consider that lexigraphic words in Arabic
> > frequently contain multiple lexemes. For example, kitAbuhA
> = kitAbu, book,
> > + suffix hA, of her = her book. In the absence of a proper
> codepoint with
> > "LEXEME DELIMITER" semantics, I can use ZWJ to provide such
> semantics
> > without affecting the rendering and search/sort behavior of
> standard Unicode
> > software:
>
> You *can* do so, but that is not the standard use of ZWJ.
> ZWNBSP would probably
> serve you better.

But with ZWNBSP, we have no semantics with respect to joining behavior, or
if we do it's well-hidden. Page 6-131 says it "behaves" like U+00A0
NO-BREAK SPACE "in that it indicates the absence of word boundaries;
however, the former has no width." But unless I'm mistaken the relation of
word boundaries and joining behavior is not addressed; nor could it be,
without a more refined notion of "word". "No word boundary" in English will
probably mean no extra space is added, and line-breaking algorithms will not
try to break there. Joining doesn't enter the picture. One could do the
same thing in an Arabic string, but interpret ZWNBSP as forcing terminal
forms on its neighbors. It's not clear to me at least that such an
interpretation would violate the definitions in Unicode; indeed that could
be a legitimate use of it, e.g. for purposes of illustrating joining
behavior. If that's the case, then I can't be sure that conformant software
will agree on the joining behavior.

But more to the point, I would argue that "use" and "interpretation" are and
should be distinct. An encoding should provide a semantics, not usage
guidelines. So long as one's usage of a codepoint harmonizes with its
formal semantics, or at least does not violate it (them?), everything should
be hunky-dory (that's 'ok' in Standard English).

And to complicate things even further, in some cases Arabic ligated forms
can be interupted by line breaks or quote marks (guillemets usually).
Haven't worked out how that would be encoded, but I think ZWJ would be
required on either side of the interrupting char. What is the proper
interpretation of ZWNBSP in such a case? It's purpose appears to be the
inhibition of line breaks, but this also involves ligature breaks (using
"ligature" in the strick sense of "tie").

Example: li-al-HayAt, "to Al-HayAt", as in "Write to Al-HayAt for more
info" (it's a newspaper). To indicate that al-HayAt is a proper name, you
enclose it in guillemets or some other typographic quoting figures; since
"li-al" is ligated, this means you have to break the join. "li-" then takes
initial form, as does the following alif of "-al". "li" and "al" also
happen to be distinct lexemes, so we want them both demarcated as such. How
would you encode that, both with and without the guillemets?

I suspect I could come up with examples where ZWNBSP could divide a single
lexigraphic word into two parts, both of which could be interpreted as
distinct lexigraphic words, in which case an implementation could either
join or not join and still get readable Arabic.

> > In this example, ZWJ falls between two characters of the
> joining class; it
> > has no effect on their form, and the ligation is formed.
>
> Then there is no point in it, at least not according to the
> standard definitions.

See above; semantics doesn't (shouldn't) address issues of utility.

> > Or, define it as a non-printing character of the
> > dual-joining class.
>
> This is probably the best definition. ZWNJ, then is an
> invisible character
> of the non-joining class.
>
> > I guess I'd have to differ with you on this interpretation. Seems
> > reasonable to me to talk of joining forms of just about
> anything _within a
> > local context_. Where 'fi' ligatures exist, the 'f' of the
> ligated form is
> > not the same form as isolated 'f'.
>
> Agreed. I think the trouble comes from the word JOINER in
> the names of ZWJ
> and ZWNJ. These characters do not "join" anything; rather,
> they provoke
> the shaping of surrounding characters by creating a pseudo-context.

I think I agree with this, but my inner hair-splitter is telling me that
they do in fact join (where appropriate); but since they're typographically
null, we just can't see it. ;) And actually, I think a formal semantics
might well express it something like that.

> > While we're at it, we also need a way to stretch the space
> between two
> > adjacent Arabic letterforms that don't join, but without
> introducing word
> > separation. Tatweel would work just fine if marking
> semantics were made
> > dependent on syntactic context - i.e. it should not be considered
> > "join-causing"; it's semantics should simply be "stretch
> whatever's there,
> > be it whitespace or a ligating stroke."
>
> That is the function of NBSP.

Same problem with the relation of joining, spacing, and word boundaries.
Might be just an issue of making what's implicit explicit: place NBSP (and
ZWNBSP) in the dual-join category. Then it has the same semantics as ZWJ,
with NB added.

> > Also needed: a means of placing diacritics over null space
> - e.g. over space
> > or a ligating stroke. ZWJ would be good for this, except
> for the part about
> > zero width. Anyway, that's a subject for a different
> thread and I gotta get
> > back to the grindstone.
>
> The convention is to put the diacritic following SP; NBSP
> would work equally
> well, sometimes better.

Same observation as above. What does it mean to put a space of any kind
between two ligated letterforms? And how does that relate to tatweel?

In the end I suspect we need a few more codepoints specifically designed to
handle such issues for Arabiform text.

Thanks,

-gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT