Re: Arabic - Alef Maqsurah

From: Roozbeh Pournader (roozbeh@sina.sharif.ac.ir)
Date: Wed Jul 14 1999 - 20:44:12 EDT


On Wed, 14 Jul 1999, Reynolds, Gregg wrote:

> The perspective I start with is this: if the computer had been invented in
> an Arabic-speaking culture, what would character encoding look like? What
> would have been expected as ordinary?

I shall correct you: an Arabic-writing culture would be a better phrase.
For the second question, as an Arabic-writer, I should first tell that
something like an Arabic-writing culture does not allow the invention of
computers able to work with anything other than numbers. Because one of
the reasons behind the fact that the western world developed the computer
that was able to render human script, was that the latin script is more
mathematical and simpler than other scripts.

> In other words, an encoding that wants to work across a multitude of
> languages must support the encoding of grammatical information, and
> abstract character semantics is insufficient
> for this purpose. Or another way to think about it is that an encoding
> should capture the active knowledge used by the reader and absent from the
> presentational form. My reading of Unicode in its current form is that it
> does this to some extent but needs to do more. Arabic provides some good
> examples of where this is needed.

Ok, but don't forget that: even when things like HTML were really a
generic markup, people tried their best to use it like a visual one. The
outcome was very bad: you should test your pages with all the browsers, at
least with IE, Netscape Communicator, and lynx.

> This is a case where Unicode could be improved by a sharper distinction
> between abstract character semantics and (abstract) presentational
> semantics. If the character semantics of U+0649 are to be "alef maqsurah",
> then the text should make clear that the presentation of the character may
> use either of two forms, "dotless ya" or "alef" (alef=U+0627). In this
> case, we need another codepoint for dotless ya. The alternative would be to
> change the semantics of U+0649 to "dotless yah" (a purely presentational
> semantics) and add a codepoint for alef maqsurah. I would prefer the
> former, myself.

I have problems with this, specially because in Persian-speaking
Arabic-writing world, there are many kinds of "YEH":

Normal Yeh, used in words like "Doosti" (friendship) with two
different pronounciations: like "y" in "yes" and like "ee" in "sheep"/
This is encoded U+06CC (FARSI YEH) with dots in initial and medial forms,
and without dots in final and isolated forms. The semantics is somehow
similiar to Arabic Yeh.

Dotted Yeh, that some people use to distinguish the words "Doosti"
(friendship) and "Doosti" (a friend). They write the latter using dotted
yeh. This is only used in final and isolated forms, and is written like
Arabic Yeh (U+064A).

Both Arabic ones, U+064A when quoting Arabic phrases, and U+0649 for
Arabic names like "Kubra" that are used in Persian. The latter is
sometimes used with SUPERSCRIPT ALEF (U+0670) and sometimes without it,
with no difference in meaning. SUPERSCRIPT ALEF is used to emphasize that
the letter is pronounced "a" (like in "park") and not "ee". (It is also
interesting to know that the Persian name for U+0670 is Alef Magsureh,
and not something meaning SUPERSCRIPT ALEF.

In this weird world, what should a typist do? Should we have four YEHs on
our keyboard? Will she use them when she sees no difference in the shape?
Will she be clever enough? Are the cases well-separated and unambiguous?
The answer is obvious. So, a Persian unicode-capable computer editor will
have only two YEHs: a dotless yeh, encoded U+06CC, and a dotted one
encoded U+064A (or a new code which may be named SINGULARITY YEH or
INDEFINITY YEH).

If we put many ways for encoding a word, without considering all the
ambiguities, something weird will raise. Like the situation with HTML.
Things like <author> tag is almost never used there, and instead people
use different combinations of other tags.

--Roozbeh



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT