RE: Latin ligatures and Unicode

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Dec 22 1999 - 16:17:01 EST


> -----Original Message-----
> From: John Cowan [mailto:jcowan@reutershealth.com]
> Sent: Wednesday, December 22, 1999 2:07 PM
>
> > Still, decomposing such a form into its consituent root
> > (k,t,b) and theme (ma-prefix, internal shape) is utterly
> elementary for
> > anybody with a little Arabic. That's how it would be
> entered and looked up
> > in dictionaries, for example.
>
> What that shows is that brute-force string search isn't very
> useful for
> Arabic, but we knew that anyhow.

Correction: "isn't very useful for Arabic" *as we now encode it*. ;)

Not to be a tease or anything, but I'm transcribing a bunch of Arabic text
using an ascii-based scheme of my devising that goes much farther than
current encodings toward the kind of full modeling that I'd like to see.
I've found it's possible to explicitly encode much more of the implicit
information embedded in Arabic text than I expected. After several
iterations I'm to the point where I think I can do a complete encoding of
even Quranic text using only ascii characters in a "flat" encoding. I.e.,
it only takes appropriately defined combining characters. With a text
encoded in this way I can use standard tools like grep and perl to do
root-based searches, detect various grammatical nuances, etc. I've got a
bit more tweaking to do on it, have to re-edit the texts, and then write
some scripts to generate pdf and do various analyses, and then I'll post it.
Probably not till late Jan. though, at the earliest.

> > Seems a shame; a little formal semantics would go a long way.
>
> Feel free to contribute it.
>

Oky Doky. Hope you're not allergic to Z; it's perfect for something like
this. (http://archive.comlab.ox.ac.uk/z.html; intro at
http://spivey.oriel.ox.ac.uk/~mike/zrm/). One of the things I'd really like
to do is provide a formal semantics of bidirectional text, so we can get rid
of the algorithm.

> > In some cases one may want to place diacritics over some
> whitespace or a
> > tatweel stroke, within a word.
>
> Ah, in the category of "whitespace" you include the whitespace between
> non-ligated characters, for example the space between "a" and "b" in
> this example: "ab". Whereas when I talk of whitespace, I
> mean whitespace
> that is wider than normal inter-letter spacing (absent ties between
> letters), as in "a b". Does this difference simply make no sense in
> Arabic script?
>

I guess I'm thinking in terms of "where there ain't no ink". But I think
this is one of those areas where we could use some more refined terminology.
How's about "liminal space"? I'm not sure what it means, but I remember it
from an Anthropology class I took 150 years ago; I think it describes
various kinds of cultic in-betweenness, or threshhold space, something like
that. I propose we use "liminal space" to mean the space between the
semiotically significant parts of letterforms, irrespective of ink. So both
joined and non-joined sequences of letterforms are punctuated by liminal
space, which is crossed by tie strokes in the former case and not in the
latter. So getting back to the original issue: one might want to place a
diacritic in liminal space.

And to answer your last question, I'm not sure, not having thought of it in
those terms. I'm inclined to say that your definition makes sense *if* we
substitute "liminal space" for "whitespace", but I need to think about it
some more.

Thanks,

-gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT