L2/01-214

From: Roozbeh Pournader [roozbeh@sharif.edu]
Sent: Saturday, May 19, 2001 2:52 PM

Proposal for Clarification of Arabic Cursive Joining Behaviour

Due to existing ambiguitites in Arabic Joining and some problems created
by adding cannonical decomposition for some of the Arabic characters in
Unicode 3.0, I propose these additions to Chapter 8.2, Section "Cursive
Joining" of the standard.

1. Specify that Cannonical decompositions are NOT transparent with regard
to Arabic joining. Require the character stream for Arabic joining to be
in "Normalization Form C" before doing Arabic joining, or act like it is.
(The problem is due to the character U+06C0, ARABIC LETTER HEH WITH YEH
ABOVE, a right-joining character which decomposes to U+06D5 U+0654, where
the first character is a non-joining one. Using NFD instead of NFC makes
the existing text problematic.)

2. Clarify which characters not in the Arabic block fall into the
Non-joining class and which into the Transparent class, based on their
General Category. I recommend these:

Transparent: Mark
Non-joining: Letter, Number, Separator, Punctuation, Symbol

for the "Other, control" and "Other, format" characters, list their Arabic
Joining class explicitly. If the character is in the following list, it
should be considered in the specified class:

Join-causing:
	U+200D ZERO WIDTH JOINER

Non-joining:
	U+200C ZERO WIDTH NON-JOINER
	U+202A LEFT-TO-RIGHT EMBEDDING
	U+202B RIGHT-TO-LEFT EMBEDDING
	U+202C POP DIRECTIONAL FORMATTING
	U+202D LEFT-TO-RIGHT OVERRIDE
	U+202E RIGHT-TO-LEFT OVERRIDE
	U+206A INHIBIT SYMMETRIC SWAPPING
	U+206B ACTIVATE SYMMETRIC SWAPPING
	U+206C INHIBIT ARABIC FORM SHAPING
	U+206D ACTIVATE ARABIC FORM SHAPING
	U+206E NATIONAL DIGIT SHAPES
	U+206F NOMINAL DIGIT SHAPES
	U+FEFF ZERO WIDTH NO-BREAK SPACE

Otherwise, one should look at the Bidirectional Character Type, if it is
B (Paragraph Separator), S (Segment Separator), or WS (WhiteSpace), it
must be considered Non-joining, otherwise, it is Transparent.

3. Make the Arabic joining data automatically computable from Unicode data
files. Include all the characters in the Arabic Block in the
'ArabicShaping.txt' file, or at least all the letters (U+0621 ARABIC
LETTER HAMZA is missing.) Also, include the characters in the general
category "Other, control" and "Other, format" in a separate section of
that file.

4. Require the Arabic joining to be done after determination of bidi
character levels of the text. The information should then be used to find
the left and right characters for Arabic joining. Current specification
allows doing joining before bidi, using the 'preceding' and 'following'
characters instead of 'right' and 'left' ones (which will create problems
in case of using bidi overrides for Arabic text). Require Arabic joinging
to be separately done on each bidi level run, so characters in different
bidi levels do not join.

5. Specify that some Bidi Boundary Neutral characters, including ZERO
WIDTH JOINER and ZERO WIDTH NON-JOINER should be retained for Arabic
joining. (Unicode should also change their Bidi class from BN to something
else, or refine the definition of BNs in the Bidi specification, so that
implementations treat character sequences like
	Meem ZWNJ ZWJ Noon
the same. Currently, they are allowed to render the above sequence either
	Noon[isolated] Meem[initial]
or
	Noon[final]    Meem[isolated]
visually.)

6. Rename the "ALEF.LAM" ligatures to LAM-ALEF ligatures. Nobody calls
them ALEF.LAM ligatures in real practice, nor it helps developers to
think easier about Arabic ligatures.