Re: Latin ligatures and Unicode

From: John Cowan (
Date: Wed Dec 22 1999 - 15:07:28 EST

"Reynolds, Gregg" wrote:

> If the standard says, as
> Mark just noted in a message, that they are to be ignored for the purposes
> of join analysis, then I stand corrected; but I haven't been able to find
> anything (admittedly I'm looking at v. 2) that says this.

I think this was a post-2.0 clarification. Probably on the Web site

> Well, sort of, but "word" isn't sufficient. You'd still want to be able to
> distinguish between distinct lexemes packaged as a single lexigraphic word
> and lexigraphic words - ordinarily whitespace delimited, but then again,
> because Arabic encodes word boundaries in the letterforms themselves, one
> could also remove SP word boundaries and use ZWSP (I think ;).

Probably. If you need to be more refined than that, you are looking at
tagging text rather than writing it, and you need either Private Zone codes
or higher-level markup.

> Still, decomposing such a form into its consituent root
> (k,t,b) and theme (ma-prefix, internal shape) is utterly elementary for
> anybody with a little Arabic. That's how it would be entered and looked up
> in dictionaries, for example.

What that shows is that brute-force string search isn't very useful for
Arabic, but we knew that anyhow.
> Seems a shame; a little formal semantics would go a long way.

Feel free to contribute it.
> But the "li-" in "li-<<-al-..." must be lam-initial, so I think ZWJ would be
> the thing for it. Otherwise wouldn't the guillemets send it into isolate
> form?

Quite right, my mistake.
> In some cases one may want to place diacritics over some whitespace or a
> tatweel stroke, within a word.

Ah, in the category of "whitespace" you include the whitespace between
non-ligated characters, for example the space between "a" and "b" in
this example: "ab". Whereas when I talk of whitespace, I mean whitespace
that is wider than normal inter-letter spacing (absent ties between
letters), as in "a b". Does this difference simply make no sense in
Arabic script?


