RE: Arabic - Alef Maqsurah

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 14 1999 - 18:49:43 EDT


Gregg Reynolds responded on this thread:

>
> One of the points I hope to make is precisely that the semantics of these
> characters is too fuzzy. But the only way to do that is with very specific
> examples, and the more we get from across the spectrum of languages using
> the Arabic script, the better. With that in mind, do you think you could
> find the time to provide an "abstract lexis" of the example words you cite
> below? That is, how would you spell them using unicode, and how would you
> spell them if you could invent unicode from the ground up?

The first part of this question is useful: it is important to specify
how various problematical distinctions are spelled using the characters
in the Unicode Standard. This because implementers will need to make
such distinctions, and it is best if the standard provides guidance as
to how it should be done. That is the intent of having explanatory sections
in the standard describing script-specific behavior. The Devanagari section
is perhaps more explicit along the lines you are expecting for Arabic,
in providing particular rules for the proper combinations of Devanagari
characters -- thus being more precise about the "semantics" of the characters
than a mere list of characters and representative glyphs can be.

But engaging in an exercise of how to spell Arabic if you could invent
Unicode from the ground up might turn out to just be confusing. You should
keep this discussion in the context of existing decisions that have been
made about all this in Arabic implementations that predate the Unicode
Standard. U+0649 ARABIC LETTER ALEF MAKSURA (which, by the way, *was* spelled
"MAQSURAH" in Unicode 1.0, but whose name was changed at the insistence of
WG2 during the merger of Unicode with 10646) is in the standard because
0xE9 ARABIC LETTER ALEF MAKSURA is in ISO/IEC 8859-6. That, in turn, was
based on ECMA-114, which was based on ASMO 449, as noted in the standard
itself. So how Arabic is spelled in computer implementations is the result
of a long history of practice. Unicode didn't just invent that out of
thin air.

>
> Your point about keyboarding is of course right on the nose. This is
> another place where I think the Unicode standard could be improved
> dramatically. Input methods, like presentation methods, should in my
> opinion be treated quite distinctly from encoding design.

They absolutely are. See any number of papers presented at the International
Unicode Conferences on input methods. It is established practice, and
neither the UTC nor WG2 design the encodings for various scripts
based on input methods. (Although they do occasionally have to deal
with petitioners for characters to be encoded that have no use except
as intermediate states for some input method or another.)

> In particular, I
> would argue that it is a mistake to associated the structure of text with
> keyboard input, as the Unicode book does.

It does not.

The text on "Logical Order" on p. 2-7, which might be taken as implying what
you state, is presented in the context of bidirectional rendering, where it
has long been understood to be in contrast to "Visual Order" -- the practice
of storing text in reversed order in the memory representation.

The section on "Keyboard Input" on p. 5-12 is directed at the problem of
input order of diacritics. The fact that either deadkey order or handwriting
sequence input order is implementable in Unicode is an example of input
method being distinct from logical design of the encoding. The point of
that section is to recommend handwriting sequence as the better input
method to implement for combining diacritics.

I agree that the Unicode Standard could make a more prominent statement
that input methods are distinct from text representation -- but up until
now, most people in the field have just assumed that. Maybe we are missing
stating the obvious more clearly.

> But literacy in Arabic is
> rather different than literacy in, say, English (to put it mildly). It
> requires a much greater degree of theoretical grammatical knowledge. So for
> a computer to behave intelligently with respect to Arabic texts, the mere
> recording of visual shapes is insufficient.

But I don't really *see* your point here. For a computer to behave
intelligently with respect to text in *any* language, the mere
recording of visual shapes in insufficient. Are we dealing with
some Arabic essentialism here? Why is this a particular problem for
the Arabic script that wouldn't equally as well turn up in the Latin
script or any other?

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT