RE: Arabic - Alef Maqsurah

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Wed Jul 14 1999 - 17:10:57 EDT


Dear Roozbeh

Terrific! Thanks very much for responding; this is exactly what I hoped
would happen. I'm pretty clear on Arabic, but have very little idea of how
the script is used by other literate communities.

I'm going to study your response tonight; for now I just wanted to
acknowledge it and blather a little.

One of the points I hope to make is precisely that the semantics of these
characters is too fuzzy. But the only way to do that is with very specific
examples, and the more we get from across the spectrum of languages using
the Arabic script, the better. With that in mind, do you think you could
find the time to provide an "abstract lexis" of the example words you cite
below? That is, how would you spell them using unicode, and how would you
spell them if you could invent unicode from the ground up?

Your point about keyboarding is of course right on the nose. This is
another place where I think the Unicode standard could be improved
dramatically. Input methods, like presentation methods, should in my
opinion be treated quite distinctly from encoding design. In particular, I
would argue that it is a mistake to associated the structure of text with
keyboard input, as the Unicode book does. More on this later (I'm supposed
to be working on something here at work, so I can only write a brief note).

Also, you're right that people will tend to think visually, I would add that
this is a case of people not being aware of the extent to which they "see"
abstract non-visual information in written text. Not because they're
stupid, but because language is so transparent. But literacy in Arabic is
rather different than literacy in, say, English (to put it mildly). It
requires a much greater degree of theoretical grammatical knowledge. So for
a computer to behave intelligently with respect to Arabic texts, the mere
recording of visual shapes is insufficient.

Gotta go now, but I look forward to further echat on this topic.

gregg

> -----Original Message-----
> From: Roozbeh Pournader [mailto:roozbeh@sina.sharif.ac.ir]
> Sent: Wednesday, July 14, 1999 7:43 PM
> To: Reynolds, Gregg
> Subject: Re: Arabic - Alef Maqsurah
>
>
>
>
> On Wed, 14 Jul 1999, Reynolds, Gregg wrote:
>
> > The perspective I start with is this: if the computer had
> been invented in
> > an Arabic-speaking culture, what would character encoding
> look like? What
> > would have been expected as ordinary?
>
> I shall correct you: an Arabic-writing culture would be a
> better phrase.
> For the second question, as an Arabic-writer, I should first tell that
> something like an Arabic-writing culture does not allow the
> invention of
> computers able to work with anything other than numbers.
> Because one of
> the reasons behind the fact that the western world developed
> the computer
> that was able to render human script, was that the latin
> script is more
> mathematical and simpler than other scripts.
>
> > In other words, an encoding that wants to work across a multitude of
> > languages must support the encoding of grammatical information, and
> > abstract character semantics is insufficient
> > for this purpose. Or another way to think about it is that
> an encoding
> > should capture the active knowledge used by the reader and
> absent from the
> > presentational form. My reading of Unicode in its current
> form is that it
> > does this to some extent but needs to do more. Arabic
> provides some good
> > examples of where this is needed.
>
> Ok, but don't forget that: even when things like HTML were really a
> generic markup, people tried their best to use it like a
> visual one. The
> outcome was very bad: you should test your pages with all the
> browsers, at
> least with IE, Netscape Communicator, and lynx.
>
> > This is a case where Unicode could be improved by a sharper
> distinction
> > between abstract character semantics and (abstract) presentational
> > semantics. If the character semantics of U+0649 are to be
> "alef maqsurah",
> > then the text should make clear that the presentation of
> the character may
> > use either of two forms, "dotless ya" or "alef"
> (alef=U+0627). In this
> > case, we need another codepoint for dotless ya. The
> alternative would be to
> > change the semantics of U+0649 to "dotless yah" (a purely
> presentational
> > semantics) and add a codepoint for alef maqsurah. I would
> prefer the
> > former, myself.
>
> I have problems with this, specially because in Persian-speaking
> Arabic-writing world, there are many kinds of "YEH":
>
> Normal Yeh, used in words like "Doosti" (friendship) with two
> different pronounciations: like "y" in "yes" and like "ee" in "sheep"/
> This is encoded U+06CC (FARSI YEH) with dots in initial and
> medial forms,
> and without dots in final and isolated forms. The semantics is somehow
> similiar to Arabic Yeh.
>
> Dotted Yeh, that some people use to distinguish the words "Doosti"
> (friendship) and "Doosti" (a friend). They write the latter
> using dotted
> yeh. This is only used in final and isolated forms, and is
> written like
> Arabic Yeh (U+064A).
>
> Both Arabic ones, U+064A when quoting Arabic phrases, and U+0649 for
> Arabic names like "Kubra" that are used in Persian. The latter is
> sometimes used with SUPERSCRIPT ALEF (U+0670) and sometimes
> without it,
> with no difference in meaning. SUPERSCRIPT ALEF is used to
> emphasize that
> the letter is pronounced "a" (like in "park") and not "ee".
> (It is also
> interesting to know that the Persian name for U+0670 is Alef Magsureh,
> and not something meaning SUPERSCRIPT ALEF.
>
> In this weird world, what should a typist do? Should we have
> four YEHs on
> our keyboard? Will she use them when she sees no difference
> in the shape?
> Will she be clever enough? Are the cases well-separated and
> unambiguous?
> The answer is obvious. So, a Persian unicode-capable computer
> editor will
> have only two YEHs: a dotless yeh, encoded U+06CC, and a dotted one
> encoded U+064A (or a new code which may be named SINGULARITY YEH or
> INDEFINITY YEH).
>
> If we put many ways for encoding a word, without considering all the
> ambiguities, something weird will raise. Like the situation with HTML.
> Things like <author> tag is almost never used there, and
> instead people
> use different combinations of other tags.
>
> --Roozbeh
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT