RE: Latin ligatures and Unicode

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Mon Dec 20 1999 - 17:47:39 EST


> -----Original Message-----
> From: John Cowan [mailto:jcowan@reutershealth.com]
> Sent: Monday, December 20, 1999 2:47 PM
>
> Marco.Cimarosti@icl.com wrote:
>
> > Yes, I think it's likely. "f+ZWL+i" would be an explicitly required
> > ligature, "f+ZWNL+i" would be an explicitly forbidden
> ligarure, "f+i" would
> > be the programmers' favorite expression: "the default".
>
> Sounds good to me.
>

Wouldn't it be more accurate to define in terms of "here's what to do in the
context of local convention"; i.e., "ZWL" means "where local convention
supports this, ligate the two chars immediately adjacent to ZWL; otherwise
don't." Then ZWNL means "where local convention provides for two distinct
forms, use 'em; otherwise, use the ligated form or collocated form or
whatever"; and adjacency (no ZW*L) means "do whatever feels right." IOW, an
explicit ligature provides support for communicating intentions that
otherwise cannot be communicated. Still seems like plaintext to me.

> > I don't understand the cursiveness example that Michael
> does above; what
> > other subtle reasons are there not to unify ZWL with ZWJ?
>
> Primarily that ZWJ and ZWNJ are essentially related to the
> idea of context-sensitive
> letterforms. They function as letters which deceive the
> shaping process.
> In the Arabic context, which is the paradigmatic one for
> these letters, the
> sequence letter+ZWJ deceives the renderer into believing that
> the letter
> is initial rather than isolated. Similarly, in Indic scripts

But is that an entirely accurate description of the semantics of ZWJ in
Arabic? All ZWJ means with respect to joining is "thou shalt join"; it
doesn't say anything about which joining form to use; that is determined
syntactically. That's important in Arabic, because e.g. initial forms
indicate, well, the initiation of a lexigraphic word. In other words, the
"contextual" forms might be better construed as syntactic or lexical forms;
they play a central role in word recognition, along with ligation.

As an example of how ZWJ could be put to good use with no notion of
"deceiving the renderer", consider that lexigraphic words in Arabic
frequently contain multiple lexemes. For example, kitAbuhA = kitAbu, book,
+ suffix hA, of her = her book. In the absence of a proper codepoint with
"LEXEME DELIMITER" semantics, I can use ZWJ to provide such semantics
without affecting the rendering and search/sort behavior of standard Unicode
software:

        kitAbu+hA = kitAbu+ZWJ+hA.

In this example, ZWJ falls between two characters of the joining class; it
has no effect on their form, and the ligation is formed. Same thing could
be done in a very common case where two distinct lexemes are by convention
treated typographically as a single lexigraphic form, even though they are
not joined. This occurs with the particle "wa", meaning roughly "and":

        wakitAbuhA = wa+ZWJ+kitAbu+ZWJ+hA

Here ZWJ again has no effect on the joining form, but does serve to join the
two strings semantically. (But of course it would be better to have a few
more delimiters tokens in Unicode.)

In other words, better to say ZWJ establishes the syntactic structure of the
lexigraph, than "change[s] the context of a particular character occurence"
(p. 6-70 of version 2). Or, define it as a non-printing character of the
dual-joining class. Or: a null baa. ZWNJ could be defined as a null
space.

The importance of this is rather subtle; it depends on the understanding
that "Arabic is cursive" is a mischaracterization, insofar as it leads to
the inaccurate notion that spaces in Arabic text have the same role as
spaces in modern European latinate text. Not so. Do the numbers on a page
of Arabic text and you'll find most lexigraphic words are broken by spaces,
some by two or three. This means that initial and final forms do not always
mean word start/end, respectively. It also means that you'll find great
variety in whitespace distribution in actual texts; intra-word whitespace is
not necessarily smaller than inter-word whitespace.

> the relevant
> forms are "normal", "normal with explicit virama" and "half
> form", and ZW[N]J
> deceives the normal rendering process here as well.
>
> ZWL, though, does not cause "f" to become "the f-form used
> with i following",
> nor "i" to become "the i-form used with f preceding", because
> there are
> no such things, and it would be intolerably ad hoc to make them so.
>

I guess I'd have to differ with you on this interpretation. Seems
reasonable to me to talk of joining forms of just about anything _within a
local context_. Where 'fi' ligatures exist, the 'f' of the ligated form is
not the same form as isolated 'f'.

I'm not sure ZWL is the right solution, but I think Michael is essentially
right in arguing that it should be possible to encode the sender's intention
with respect to ligatures in latinate text. I'm inclined to think ZWL is a
good idea, based on the notion that ZWJ is for languages/scripts where
ligation performs a semantic function related to word recognition in a way
that latinate ligatures generally do not.

ZWL might even be defined to mean "use first alternate collocational form"
in Arabiform contexts; ZWNL could mean "use ordinary side-by-side ligation,
even if the default font prefers a fancy collocational glyph for this pair."
Absence of ZW*L would mean the font or stylesheet gets to choose, as is the
case now.

While we're at it, we also need a way to stretch the space between two
adjacent Arabic letterforms that don't join, but without introducing word
separation. Tatweel would work just fine if marking semantics were made
dependent on syntactic context - i.e. it should not be considered
"join-causing"; it's semantics should simply be "stretch whatever's there,
be it whitespace or a ligating stroke." (Apologies if this has already been
corrected; I'm looking at v. 2 of the book.)

Also needed: a means of placing diacritics over null space - e.g. over space
or a ligating stroke. ZWJ would be good for this, except for the part about
zero width. Anyway, that's a subject for a different thread and I gotta get
back to the grindstone.

-gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT