Re: Help with some Arabic letters

From: Patrick Andries (pandries@iti.qc.ca)
Date: Wed Dec 15 1999 - 10:22:06 EST


Greg,

> Greg Reynolds wrote :
> Patrick Andries wrote:
>
> > I wondered whether anyone could help me determine the usage and
> > pronunciation of two Arabic script letters. I seem to be lacking
> > proper reference material.
> >
> > For instance : U+06C7 LETTER U, is this a Uyghur letter pronounced
> > [u] ? (Unicode 2.0 mentions only Kirghiz) My weak eyes cannot see
> > properly what Daniels reproduced on p. 760. U+06C8 LETTER YU, is
> > this Uyghur letter pronounced [y] ? Any help greatly appreciated, P.
> > Andries
>
> 1. Unicode "letters" have no pronunciation. Unicode is openly hostile
> to language-specific interpretations of its letterforms ("glyphs"? ).

An aversion which is grounded when a letter is used by several scripts or
uses letters no American can pronounce properly for that language (let's say
Q in Arabic).

But my question is slighly different : how is the Uyghur letter represented
by the should-not-be-pronounced Unicode letter U+06C7 pronounced ?
Do I thus escape the stake of Unicode orthodoxy.

> 2. Unicode is especially weak in the area of arabiform "characters".
> Your query illustrates this weakness quite nicely. U+06C7, "ARABIC
> LETTER U", as Unicode 2.x calls it, looks suspiciously like U+0648,
> U+064F.

True.

> In "Arabic", this is perfectly acceptable and indeed occurs
> frequently. A user would be perfectly justified in spelling a word like
> "wujuwd" with U+06C7 as the first codepoint.

Hmm. Interesting problem. I suppose in practice this will not happen since
at input, if a user is Arabic, he will key-in a waw followed by damma while
a Uyghur will use a U+06C7 (a single key on a traditional keyboard ?).

> 3. So U+06C7, in written Arabic, would denote something like "woo".

But this is then a glyph representation since there is no single letter in
Arabic ressembling U+06C7, while it looks like in Uyghur this represents a
single sound and letter.

> 3. It's probably best to ignore the Unicode "letter" names; most of the
> "Arabic" letter names are not used in Arabic. Arabic, unlike Latin, is
> a living language, so calling a letter "Arabic" is not quite the same as
> calling a letter "Latin". I suggest "Arabiform" as a suitably
> language-neutral form ("Latinate" would also be an improvement).

I would tend to agree with a change in names.

> 4. Beware of Bright and Daniels. Just because it's outrageously
> expensive doesn't mean it's right.

Yes, I have already come across omissions and errors in the Khmer part.

> 4. Consider that "Arabic is to the local language as Latin is to modern
> European languages" is not such a bad formulation. Arabic, like Latin
> (and Greek) in Europe, provides the vocabulary of philosophy, religion,
> law, etc. for cultures that have adopted it and it's associate
> religion. I can't say for certain, but I would not be at all surprised
> to find that spellings like U+06C8 were adopted into local languages as
> part of a technical religious vocabulary, and not used elsewhere in the
> written language.

Except that U+06C8 as [y] ( in German) will seem to fill a need in Turkic
languages: a way to write "" a sound inexistant in Arabic or Farsi.

> Oh what the hell. Since you got me going, here's another small part of
> Unicode that looks questionable to me. Tanween is modeled as fathatan,
> dammatan, and kasratan (U+064B, C, and D, respectively). The sample
> glyphs show a pair of strokes in each case (dammatan is actually a pair
> of dammas.) Ok. Only problem is, tanween (meaning, roughly "enning",
> or adding an "n" sound) is distinct from vowelization. (it signifies
> indefiniteness in a noun.) Originally it was actually indicated by
> writing the full noon (U+0646) after the vowel ending. Then a doubling
> of the vowel mark came to be used. In the Quran, the second mark is
> sometimes written a little to the side instead of directly over the
> first, to indicate elision ("idghAm" is the technical term) with the
> following consonant; or, it may be replaced entirely by a little meem
> (U_0645), to indicate the the "n" sound shades over into an "m" sound
> because of the following consonant ("nb" --> "mb").
>
> Also, the names should be changed to reflect the composed semantics:
> fatha with tanween, damma with tanween, kasra with tanween.
>
> So a search for, e.g., kitAbu, should find kitAb(un), where
> (un) symbolizes U+064C; that is, 'u' modified by tanween (noonation).
> And a search for kitAb(un) should arguably find kitAbuN, where 'N'
> symbolizes the as-yet undefined tanween codepoint, as well as kitAbuu,
> where two consecutive damma marks makes a damma+tanween. But unless
> I've misread the standard (entirely possible), there is nothing in
> Unicode that provides for this.

I'm not sure how this could be solved at the level of Unicode.

> Furthermore, it isn't clear how one
> should encode tanween in a text. Does e.g. U+064C suffice? Or should
> one inscribe the vowel followed by the tanween mark - U+064E, U+064B? I
> submit there should be a mapping from each tanween codepoint to the
> combination of vowel mark and proper tanween mark. Which means Unicode
> needs a new "tanween" codepoint, and the current composed tanween
> "characters" should be defined as compositions of vowel mark + tanween.
> One could also argue that a pair of (identical) vowel marks should be
> interpreted as vowel+tanween, since that is after all the semantics.
>
> I've transcribed a short passage from a reference grammar explaining all
> this, but haven't gotten around to translating it. When I do I'll put
> it on a web page, along with scanned images illustrating the concepts,
> and then make a formal proposal to The Consortium.

I'll be very interested to read the translation.

> Hope that helps.

Thanks a lot for your explanations.

Patrick



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT