Re: Help with some Arabic letters

From: Gregg Reynolds (greynolds@greynolds.com)
Date: Wed Dec 15 1999 - 13:56:23 EST


Patrick Andries wrote:

> I wondered whether anyone could help me determine the usage and
> pronunciation of two Arabic script letters. I seem to be lacking
> proper reference material.
>
> For instance : U+06C7 LETTER U, is this a Uyghur letter pronounced
> [u] ? (Unicode 2.0 mentions only Kirghiz) My weak eyes cannot see
> properly what Daniels reproduced on p. 760. U+06C8 LETTER YU, is
> this Uyghur letter pronounced [y] ? Any help greatly appreciated, P.
> Andries

1. Unicode "letters" have no pronunciation. Unicode is openly hostile
to language-specific interpretations of its letterforms ("glyphs"? ).

2. Unicode is especially weak in the area of arabiform "characters".
Your query illustrates this weakness quite nicely. U+06C7, "ARABIC
LETTER U", as Unicode 2.x calls it, looks suspiciously like U+0648,
U+064F. In "Arabic", this is perfectly acceptable and indeed occurs
frequently. A user would be perfectly justified in spelling a word like
"wujuwd" with U+06C7 as the first codepoint. How many implementations
would be able to handle this? By "handle", I mean not just render it,
but properly interpret it as a waw with damma.

3. So U+06C7, in written Arabic, would denote something like "woo".

4. U+06C8, in written Arabic, would be pronounced more or less as a
lenthened 'ah' sound (I don't know the IPA name for it). In English
language grammars of Arabic, it is often referred to as a "defective"
spelling. That means that the waw (U+0648) is a radical of the word,
but for phonotactic reasons in actual pronunciation it is transformed
into an ah phoneme. In Modern Standard Arabic, an alif (U+0627) would
be used in spelling; but in many texts in the tradition, it is spelled
U+06C8, although the "ah", and not the waw, is pronounced. The classic
example of this is "SalAt", prayer, where the A may be written either as
an alif or as a waw with a superfixed alif U+0670 (which may be called
fatha, depending on whom you ask).

3. It's probably best to ignore the Unicode "letter" names; most of the
"Arabic" letter names are not used in Arabic. Arabic, unlike Latin, is
a living language, so calling a letter "Arabic" is not quite the same as
calling a letter "Latin". I suggest "Arabiform" as a suitably
language-neutral form ("Latinate" would also be an improvement).

4. Beware of Bright and Daniels. Just because it's outrageously
expensive doesn't mean it's right.

4. Consider that "Arabic is to the local language as Latin is to modern
European languages" is not such a bad formulation. Arabic, like Latin
(and Greek) in Europe, provides the vocabulary of philosophy, religion,
law, etc. for cultures that have adopted it and it's associate
religion. I can't say for certain, but I would not be at all surprised
to find that spellings like U+06C8 were adopted into local languages as
part of a technical religious vocabulary, and not used elsewhere in the
written language.

As for Uighur, I'm afraid I can't be of much assistance. I do have some
Uighur reference material, but a) I'm too lazy to try to figure it out
at the moment (sorry); and b) if it's anything like Arabic, such
orthographic considerations are deeply embedded in the fabric of the
grammar. You won't find much information on the more remote reaches of
Arabic orthography in the average "Arabic for Khawagas" textbook; you
need a pretty good grasp of the language to be able to understand the
orthography. ("Khawaga" being Egyptian Arabic for "furrener").

Lest you think I've just conceived an irrational hatred of Unicode, let
me just say this about that: it's entirely rational. Ha, ha, just
kidding! Nobody can deny that the people who have put Unicode together
have done us all a great service; but also nobody could deny that they
didn't get everything perfectly right (how could they?). I'm still
working more-or-less diligently on a rigorous model of written Arabic
that will be accessable even to those unfortunates cursed with ignorance
of the language, hoping that it wil be of use in improving Unicode. But
other stuff 'n stuff keeps getting in the way. If you're doing Uighur
or some other Arabiform w-language, maybe we can combine forces.

Oh what the hell. Since you got me going, here's another small part of
Unicode that looks questionable to me. Tanween is modeled as fathatan,
dammatan, and kasratan (U+064B, C, and D, respectively). The sample
glyphs show a pair of strokes in each case (dammatan is actually a pair
of dammas.) Ok. Only problem is, tanween (meaning, roughly "enning",
or adding an "n" sound) is distinct from vowelization. (it signifies
indefiniteness in a noun.) Originally it was actually indicated by
writing the full noon (U+0646) after the vowel ending. Then a doubling
of the vowel mark came to be used. In the Quran, the second mark is
sometimes written a little to the side instead of directly over the
first, to indicate elision ("idghAm" is the technical term) with the
following consonant; or, it may be replaced entirely by a little meem
(U_0645), to indicate the the "n" sound shades over into an "m" sound
because of the following consonant ("nb" --> "mb").

Also, the names should be changed to reflect the composed semantics:
fatha with tanween, damma with tanween, kasra with tanween.

So a search for, e.g., kitAbu, should find kitAb(un), where
(un) symbolizes U+064C; that is, 'u' modified by tanween (noonation).
And a search for kitAb(un) should arguably find kitAbuN, where 'N'
symbolizes the as-yet undefined tanween codepoint, as well as kitAbuu,
where two consecutive damma marks makes a damma+tanween. But unless
I've misread the standard (entirely possible), there is nothing in
Unicode that provides for this. Furthermore, it isn't clear how one
should encode tanween in a text. Does e.g. U+064C suffice? Or should
one inscribe the vowel followed by the tanween mark - U+064E, U+064B? I
submit there should be a mapping from each tanween codepoint to the
combination of vowel mark and proper tanween mark. Which means Unicode
needs a new "tanween" codepoint, and the current composed tanween
"characters" should be defined as compositions of vowel mark + tanween.
One could also argue that a pair of (identical) vowel marks should be
interpreted as vowel+tanween, since that is after all the semantics.

I've transcribed a short passage from a reference grammar explaining all
this, but haven't gotten around to translating it. When I do I'll put
it on a web page, along with scanned images illustrating the concepts,
and then make a formal proposal to The Consortium.

Hope that helps.

-gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT