RE: Arabic - Alef Maqsurah

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 14 1999 - 21:06:53 EDT


Gregg,

> >
> > But engaging in an exercise of how to spell Arabic if you could invent
> > Unicode from the ground up might turn out to just be
> > confusing. You should
>
> I hope not; the idea is to clarify. This doesn't exclude complexity; Arabic
> writing practice is complex.

Granted. We hint at that in the book, while admittedly only scratching
the surface of what needs to be covered for Arabic rendering. Much
fuller discussions can be found in Thomas Milo's and Kamal Mansour's
papers in the Unicode Conference Proceedings, among others.

>
> > keep this discussion in the context of existing decisions
> > that have been
> > made about all this in Arabic implementations that predate the Unicode
> > Standard.
> ...
> > itself. So how Arabic is spelled in computer implementations
> > is the result
> > of a long history of practice. Unicode didn't just invent that out of
> > thin air.
> >
>
> No doubt. But personally I've never accepted "that's the way its always
> been done" as sufficient reason to accept the status quo. On the contrary I
> think we're obligated to point out what we perceive to be problems. Who
> knows, maybe the powers that be in Unicode will get something useful out of
> this discussion. My personal project is to model the working of Arabic
> texts, so my loyalties are to the language, not to legacy software.

Here, "legacy" software includes, of course, Office 2000, which is only
just now becoming available, with Unicode-based Arabic as part of the
package. That's pretty new to already be scorned as "legacy software."
 
> Of
> course I understand people with money invested in current solutions also
> have an interest in the discussion. But a clear (mathematical) model would
> be in everybody's interest. (OTOH, if people on the list find this
> excruciating and obnoxious just tell me and I'll go mutter in a corner.)

A clear mathematical model will be in everybody's interest if it does
in fact clarify Arabic implementations. But my sense of the level you
are dealing with Arabic text is that it might in fact confuse people
about what the basic relations are between the characters encoded in
the standard and the rendering rules required to get correct minimal
display, and the more complex rendering rules required to get quality
typography in various styles. It is already hard enough to keep explaining
to people not to use any of the FBXX ligature characters, and to only
use the FEXX positional form compatibility characters when interoperating
with certain older code-page-based implementations.

I'd rather we get everybody straight on the relationship between the
existing encoding and expected presentation (including all the complexity
of bidirectional rendering), before embarking on a speculation about
how it might have been encoded to better take Arabic morphology into account.

> >
> > I agree that the Unicode Standard could make a more prominent
> > statement
> > that input methods are distinct from text representation --
> > but up until
> > now, most people in the field have just assumed that. Maybe
> > we are missing
> > stating the obvious more clearly.
>
> Indeed I think some explicit discussion of the relation between input
> method, text syntax, and output representation would improve the standard.
> Lots of people are looking at this unicode thing, and in my experience,
> people outside of the technical field and even many within it completely
> misundertand it. Much of the confusion (IMHO) is due simply to loose
> terminology.

We keep working on the terminology, and have tightened up a lot in
the new version 3.0 (forthcoming). --Although, unfortunately, this area
of input methods is not scheduled for any new additions or
clarifications at the moment.

But in my opinion, most of the confusion about such issues and the
Unicode Standard are not really the result of loose terminology, but
results from the inherent complexity of the topic and of writing
systems, combined with a failure to engage and study the topic
long enough to break free of simplistic assumptions going in.
It is incredibly hard to break people of the assumption that it ought
to be easier and more consistent to just provide a character encoding
for each glyph, to make it easier to continue having a simple rendering
model of character=glyph for display -- as we saw in some of the
recent spate of discussions about dotless-j.

>
> >
> > > But literacy in Arabic is
> > > rather different than literacy in, say, English (to put it
> > mildly). It
> > > requires a much greater degree of theoretical grammatical
> > knowledge. So for
> > > a computer to behave intelligently with respect to Arabic
> > texts, the mere
> > > recording of visual shapes is insufficient.
> >
> > But I don't really *see* your point here. For a computer to behave
> > intelligently with respect to text in *any* language, the mere
> > recording of visual shapes in insufficient. Are we dealing with
> > some Arabic essentialism here? Why is this a particular problem for
> > the Arabic script that wouldn't equally as well turn up in the Latin
> > script or any other?
> >
>
> I think it probably does turn up for many languages - remember my concern is
> with encoding texts in the language, not the script. It's not a question of
> essentialism (whatever that is) but peculiarlism. (In two words: clitics
> and non-concatenative morphology.)

Ah, so it *is* an issue of Arabic essentialism. The morphological (or whatever--
fill in your list of attributes here) essence of Arabic is different from
that of other languages; therefore it must be treated in an essentially
different way in encoding (or whatever--fill in your list here) to be
handled correctly.

> It goes back to the question of what a
> reasonable literate should be entitled to expect out of digital text. For
> example, in Arabic (or any Semitic language for that matter) this means,
> among other things, intelligence with respect to word structure. One could
> argue that it is unreasonable for an English reader to expect to be able to
> search for all words related to "sing", for example, and find "sung",
> "song", etc.

Here you are talking about the lemmatizing problem for search algorithms.
This is, indeed, very sensitive to the morphology and morphosyntactic
structures of particular languages. Implementers of multilingual search
engines are well aware of this problem and must tailor their algorithms
to deal with the particular morphologies they encounter. Perhaps that is
the kind of forum where this discussion re Arabic belongs.

> I mean the encoding of such information in the data could
> reasonably be considered beyond the scope of the encoding definition, so the
> capability would be a matter of specialized software.

Yes and yes. You just cannot build morphological structure into a practical
character encoding -- especially one which has to be universal, and applicable
to representation of text in any language, living or dead, in any script.

> But the sequential
> nature of text encoding matches up well with the structure of some languages
> (e.g. English and presumably most Indo-European languages; also Japanese);
> you can do a lot just by manipulating sequential strings of "characters",
> and you don't need much in the way of metalinguistic codes that are not
> already present as inking characters (e.g. punction).
>
> But it doesn't work that way in Arabic and many other languages. It's not
> only perfectly natural to think in terms of word roots abstracted away from
> particular word forms, it would be positively un-Arabic to think about the
> language in any other way.

Ah, but here is where your basic approach, as it applies to the Unicode
Standard, breaks down. The Arabic *script* is what is encoded in the
standard. The Arabic script is used to represent text in hundreds of
non-Semitic languages, from Urdu, to Malay, to Uighur, to Persian, to Pashto, to
Swahili, as well as the Semitic core languages. Those languages run the
complete gamut of morphological types. You can't just reconstruct the
encoding of the Arabic script in Unicode to tailor it to the Arabic
*language* morphology, when it can and is used to represent text in all
the other languages, including many Indo-European languages, for that matter.

> The argument I will make (eventually; it's
> quitting time just now) is that such structural information is rightfully
> part of the standard encoding; the intelligence should be moved from
> specialized logic in software and embedded in the text.

Nope.

You can always embed it in specialized text devoted to the Arabic language
in particular (either through markup or your own morphologically-based
encoding in private use space), but that is not the design point of
the Unicode Standard for plain text representation.

--Ken

> (I hope you'll
> forgive me if this has all been done to death previously.) But of course
> this is the kind of argument that must be backed up by numerous detailed
> examples, and right now I've got to run.
>
> g'night,
>
> gregg
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT