RE: Arabic - Alef Maqsurah

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Thu Jul 15 1999 - 09:41:44 EDT


Dear Ken,

Thanks very much for your thoughtful reply. A few points before I head back
into the salt mines:

> -----Original Message-----
> From: kenw@sybase.com [mailto:kenw@sybase.com]
> Sent: Wednesday, July 14, 1999 8:07 PM
> To: greynolds@datalogics.com
> Cc: unicode@unicode.org; kenw@sybase.com
> Subject: RE: Arabic - Alef Maqsurah
>
>
> > this discussion. My personal project is to model the
> working of Arabic
> > texts, so my loyalties are to the language, not to legacy software.
>
> Here, "legacy" software includes, of course, Office 2000,
> which is only
> just now becoming available, with Unicode-based Arabic as part of the
> package. That's pretty new to already be scorned as "legacy software."

It's not that I scorn legacy software; that would be like scorning gravity
(or God, where a certain software maker is concerned). I just think the
natual language, and not legacy software ("encoding designs" would be a
better term here) should be the yardstick.

> > misundertand it. Much of the confusion (IMHO) is due
> simply to loose
> > terminology.
>
> We keep working on the terminology, and have tightened up a lot in
> the new version 3.0 (forthcoming). --Although, unfortunately,
> this area
> of input methods is not scheduled for any new additions or
> clarifications at the moment.
>
> But in my opinion, most of the confusion about such issues and the
> Unicode Standard are not really the result of loose terminology, but

Yes; I should have said "unfinished" or the like instead of loose; I don't
mean to imply the editors are slackers.

> >
> > I think it probably does turn up for many languages -
> remember my concern is
> > with encoding texts in the language, not the script. It's
> not a question of
> > essentialism (whatever that is) but peculiarlism. (In two
> words: clitics
> > and non-concatenative morphology.)
>
> Ah, so it *is* an issue of Arabic essentialism. The
> morphological (or whatever--
> fill in your list of attributes here) essence of Arabic is
> different from
> that of other languages; therefore it must be treated in an
> essentially
> different way in encoding (or whatever--fill in your list here) to be
> handled correctly.
>

One request: please let's not resort to such labels. "Ism-ism" in my
opinion almost always obscures more than it enlightens. As to the specifics
of your comment, I am emphatically not making the case that Arabic has some
sort of mystical essence that deserves some kind of special treatment. On
the contrary my point is precisely that it and many other languages that do
not share the linguistic features that make e.g. English amenable to digital
representations already receive a kind of special treatment, in that they
must be encoded using a strategy designed for one class of languages. I
think this situation could be remedied to a certain extent without breaking
unicode.

>
> Here you are talking about the lemmatizing problem for search
> algorithms.
> This is, indeed, very sensitive to the morphology and morphosyntactic
> structures of particular languages. Implementers of
> multilingual search
> engines are well aware of this problem and must tailor their
> algorithms
> to deal with the particular morphologies they encounter.

But this begs the question. They don't encounter particular morphologies;
they encounter particular encodings. Encodings, natural and artificial,
always reflect some theory of language. Change the encoding and you change
the problem.

>
> Yes and yes. You just cannot build morphological structure
> into a practical
> character encoding -- especially one which has to be
> universal, and applicable
> to representation of text in any language, living or dead, in
> any script.

On the contrary, you cannot *not* build morphological structure into an
encoding. Unicode already does: lexemes are built by concatenating text
atoms. Works great for English, not so great for e.g. Arabic. How else can
one explain the space "character" as a positive element? Even for Arabic
Unicode accomodates some level of morphological intelligence: "contextual
shaping" encodes morphology (prosodic word boundary). Every "natural"
encoding of language into visual form does the same to some extent. It's
not a question of whether, but of how much.

>
> Ah, but here is where your basic approach, as it applies to
> the Unicode
> Standard, breaks down. The Arabic *script* is what is encoded in the
> standard. The Arabic script is used to represent text in hundreds of
> non-Semitic languages, from Urdu, to Malay, to Uighur, to
> Persian, to Pashto, to
> Swahili, as well as the Semitic core languages. Those
> languages run the
> complete gamut of morphological types. You can't just reconstruct the
> encoding of the Arabic script in Unicode to tailor it to the Arabic
> *language* morphology, when it can and is used to represent
> text in all
> the other languages, including many Indo-European languages,
> for that matter.

Understood, but my view is that this is where Unicode itself gets a little
confused. Does it or does it not encode presentational (visual) form?
Arabic presentational forms (by which I mean all letterforms used in
writing) are indeed used in many languages from different families, but do
these presentational forms share the same character semantics across
languages? I sincerely doubt it. So an encoding that works across
languages must sharply distinguish between character semantics and
presentational form. Which gets us back to grammatical encoding. BTW, in
one of your earlier notes you pointed out that handwriting sequence is the
preferable guide to implementing input methods. This is the alternative:
grammatical sequence.

I'll put together some examples of what I mean this weekend.

>
> > The argument I will make (eventually; it's
> > quitting time just now) is that such structural information
> is rightfully
> > part of the standard encoding; the intelligence should be moved from
> > specialized logic in software and embedded in the text.
>
> Nope.
>
> You can always embed it in specialized text devoted to the
> Arabic language
> in particular (either through markup or your own morphologically-based
> encoding in private use space), but that is not the design point of
> the Unicode Standard for plain text representation.
>

Not to be provocative, but isn't it interesting how "plain text" just seems
to work for some languages and not for others?

I don't want you to misconstrue my remarks as a mere whine about the woeful
state of the world; I've actually got some concrete suggestions that I'll
post this weekend along with some more background info. I think they're
technically feasable, which probably dooms them ;.)

Thanks again,

Gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT