RE: Arabic Script.

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Mar 13 2001 - 05:51:45 EST


Nidhal Zarrad wrote:
> Dear Madam/Sir
> I hope you would help me with the following problem:
> I am trying to write a program that displays Arabic words in
> DOS mode (Win32
>
> independent). To achieve this task I using C++ (non-visula).
> I converted Arabic TTF fonts into CHR fonts. When I run the
> program I got
> unlinked Arabic letters instead of whole words.
> e.g (ن ض ا ل)instead of
> (نضال).
> Do you know the cause of this problem? And Is is possible to
> write in Arabic
> in DOS mode where Windows is absent?

Oh, oh. I am afraid that this won't happen magically.

Properly displaying text is the task of the so-called "Unicode Rendering
Engine" and, in this case, the Unicode Rendering Engine is YOUR program.

Arabic is one of the so-called "complex scripts" of Unicode. The complexity
lays in the fact that displaying Arabic text requires something more than
simply drawing one picture (or "glyph") for each character.

Displaying Unicode Arabic requires two "complex" steps:

1) The "Bi-directional Algorithm".
2) The "Arabic Shaping Algorithm".

((1))
The bi-directional algorithm is needed for any text that contains words in
any right-to-left script (Arabic, Hebrew, Syriac or Maldivian).

This is needed because, of course, you cannot draw Arabic from
left-to-right. But you cannot even simply draw them right-to-left, because
some parts of the text (numbers and words in other languages) must remain
left-to-right even within Arabic text.

If you have not already implemented the Bi-directional Algorithm in your
program, you can find a description of it in
<http://www.unicode.org/unicode/reports/tr9/tr9-8.html> and a C/CPP
reference implementation in
<http://www.unicode.org/unicode/reports/tr9/BidiReferenceCpp/>.

((2))
The second step, the Arabic Shaping Algorithm, is needed to select the
proper shape (isolated, initial, medial, final) for each Arabic letter.

This is needed because Unicode Arabic should normally be encoded with
characters in range 0x0600 to 0x06FF
(<http://charts.unicode.org/Web/U0600.html>). These character are called
"logic" Arabic characters because they do not have a defined shape.

For instance, the "n" in Nidhal should always be encoded with character
0x0646, regardless that it is isolated, initial, medial, or final.

Don't be confused by the fact that the Unicode chart shows a final form for
character 0x0646: that is only an EXAMPLE!

(By the way, in your mail, you encoded your name using characters in the
range 0xFB50 to 0xFEFC: beware that those characters (like most characters
in range 0xF900 to 0xFFFD) should NOT be used to encode text. They are
called "compatibility characters" and are there for a variety of reason, but
are not to be used in files, e-mails, etc.).

Unicode does not specify exactly how Arabic shaping has to be done. The
reason for this vagueness is that there are many ways of displaying Arabic,
ranging from extremely simple (e.g. à la typewriter: with only two forms per
letter) to extremely complex (e.g. the beautiful Riq'ah cursive). For this
reason, Unicode just decided to encode the LOGICS of the text, and leave the
graphics details to the implementers of programs and fonts.

However, Unicode does provide a simple default way of displaying Arabic, and
this is where characters 0xFB50 to 0xFEFC come in.

You program can do a temporary copy of the string to be displayed and, in
that temporary copy, substitute each character in range 0x0600 to 0x06FF to
one of the corresponding characters in range 0xFB50 to 0xFEFC.

As each character 0x0600-to-0x06FF corresponds to SERVERAL characters
0xFB50-to-0xFEFC, your program has to choose the correct shape according to
the position of the character in the word. Or, in other terms, analyzing the
neighboring characters.

Also notice that some pairs of logical characters 0x0600-to-0x06FF (e.g. a
laam followed by an alif) correspond to a single visual character
0xFB50-to-0xFEFC.

Unicode does not have a reference implementation for the default Arabic
shaping. However, some basic data about it are available.

<http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt> is useful to
extract the "Joining Type" for each characters:

- R ("Right linking"): all 2-form letters that can only link to the letter
on their right side (e.g., alif, waw, etc.)

- D ("Dual linking"): all 4-form letters that can link to the letters on
both sides.

- C ("Link Causing"): characters that do not change form themselves, but
cause neighboring letter to take the linked form (e.g.: "____", the
tatweel).

- U ("Not linking"): characters that do not link to other (e.g., hamza,
punctuation characters, and all non-Arabic characters).

- Transparent characters: these don't influence linking, so they should be
ignored when analyzing neighboring characters (e.g., shadda or diacritic
vowels).

<http://www.unicode.org/Public/UNIDATA/UnicodeData.txt> is useful to extract
the correspondence between logical characters (0x0600 to 0x06FF) to
compatibility graphic characters (0xFB50 to 0xFEFC).

Roman Czyborra has a sample Perl implementation of the algorithm on his
site: <http://czyborra.com/arabjoin/arabjoin>.

Ciao.
_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT