Serious problems with Arabic

From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Thu Nov 16 2000 - 07:13:02 EST


Dear All,
 
I have serious problems with Unicode Arabic. The main problem is with the
Arabic shaping rules in TUS 3.0, pages 192--197. I think these should be
changed in some suggested ways. Would someone please guide me on how
should I prepare an official suggestion?
 
1. "Bidi and Cursive Joining". Page 192 mentions:
 
        "An implementation may choose to restate the following rules
         according to logical order so as to apply before the bidirectional
         algorithm's reordering phase. In this case, the words right and
         left as used in this section would become preceding and
         following."
 
But the effect is not the same! Consider the sequence
 
        U+0628 U+202D U+0627 U+0631 U+202C
        BEH LRO ALEF REH PDF
 
If you apply bidi to this, you'll obtain
 
        ALEF REH BEH
 
which will then become
 
        ALEF<isolated> REH<final> BEH<initial>
 
after cursive joining. But now try to reverse the order. First apply joining
and then bidi. Having in mind that LRO is transparent regarding joining,
(page 192, table 8-2 includes all format marks as being transparent; RLM is
included as an example, so we can deduce that by format marks, TUS means the
characters in the character class Cf, "Other, Format").
first you'll have
 
        BEH<initial> LRO ALEF<final> REH<isolated> PDF
 
and after bidi,
 
        ALEF <final> REH<isolated> BEH <initial>
 
The former case is unacceptable because BEH and REH which are not adjacent
in logical order (this is the order one reads the text aloud), have joined
together, where one cannot find that they were not adjacent. The latter form
is also unacceptable, since you have a final ALEF, but it joins to nowhere
(you have not requested this, because you have not mentioned any ZWJ
in the text). It seems that this is the case that may occur with Arabic
enabled editors, when user is playing with the text. And it seems that
both solutions are probelmatic. UAX #9, in Reordering Resolved Levels,
recommends the latter case.
 
My suggestion is making the five controls RLE, LRE, RLO, LRO, and PDF
non-joining and not transparent which will solve the problem. First, when
someone uses the explicit marks, he wants to render the text in different
levels, and second, the applications may now apply the joining before or
after the bidi, (they should consider the Retaining Format Codes part in
UAX #9 if they want to do joining after bidi).
 
2. "Transparency of Canonnical Decomposition". The standard claims
transparency according to cannonical decomposition. The text should have
the same behaviour if it is decomposed. But this is not true regarding
shaping U+06C0, ARABIC LETTER HEH WITH YEH ABOVE. It decomposes to
U+06D5 U+0654 which is ARABIC LETTER AE + ARABIC HAMZA ABOVE, while
HEH WITH YEH ABOVE is in the right-joining class and AE is in the
non-joining class. This will create problems for example with normal
Persian texts using the HEH WITH YEH ABOVE. If one has the very common
 
   <KHAH> <ALEF> <NOON> <HEH WITH YEH ABOVE>
 
(I'll follow the logical order, bidi is of no importance here), and then
shapes that, he will get
 
   <KHAH-initial> <ALEF-final> <NOON-initial> <HEH WITH YEH ABOVE-final>
 
but if he decomposes that and then applies the shaping, he will get
 
   <KHAH> <ALEF> <NOON> <AE> <HAMZA ABOVE>
 
and then
 
   <KHAH-initial> <ALEF-final> <NOON-isolated> <AE-isolated> <HAMZA ABOVE>
 
The last two are visually equal to <HEH WITH YEH ABOVE-isolated>. You can
see the difference between the shaping of NOON and AE. This is unbearable.
 
My suggestion would be decomposing U+06C0 to
 
        U+0647 U+0654 U+200C
        <ARABIC LETTER HEH> <ARABIC HAMZA ABOVE> <ZERO WIDTH NON-JOINER>
 
which seems to be the only solution for this. I again insist that this
case appears really frequently in Persian, where HEH WITH YEH ABOVE is
very common.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT