L2/01-214 From: Roozbeh Pournader [roozbeh@sharif.edu] Sent: Saturday, May 19, 2001 2:52 PM Proposal for Clarification of Arabic Cursive Joining Behaviour Due to existing ambiguitites in Arabic Joining and some problems created by adding cannonical decomposition for some of the Arabic characters in Unicode 3.0, I propose these additions to Chapter 8.2, Section "Cursive Joining" of the standard. 1. Specify that Cannonical decompositions are NOT transparent with regard to Arabic joining. Require the character stream for Arabic joining to be in "Normalization Form C" before doing Arabic joining, or act like it is. (The problem is due to the character U+06C0, ARABIC LETTER HEH WITH YEH ABOVE, a right-joining character which decomposes to U+06D5 U+0654, where the first character is a non-joining one. Using NFD instead of NFC makes the existing text problematic.) 2. Clarify which characters not in the Arabic block fall into the Non-joining class and which into the Transparent class, based on their General Category. I recommend these: Transparent: Mark Non-joining: Letter, Number, Separator, Punctuation, Symbol for the "Other, control" and "Other, format" characters, list their Arabic Joining class explicitly. If the character is in the following list, it should be considered in the specified class: Join-causing: U+200D ZERO WIDTH JOINER Non-joining: U+200C ZERO WIDTH NON-JOINER U+202A LEFT-TO-RIGHT EMBEDDING U+202B RIGHT-TO-LEFT EMBEDDING U+202C POP DIRECTIONAL FORMATTING U+202D LEFT-TO-RIGHT OVERRIDE U+202E RIGHT-TO-LEFT OVERRIDE U+206A INHIBIT SYMMETRIC SWAPPING U+206B ACTIVATE SYMMETRIC SWAPPING U+206C INHIBIT ARABIC FORM SHAPING U+206D ACTIVATE ARABIC FORM SHAPING U+206E NATIONAL DIGIT SHAPES U+206F NOMINAL DIGIT SHAPES U+FEFF ZERO WIDTH NO-BREAK SPACE Otherwise, one should look at the Bidirectional Character Type, if it is B (Paragraph Separator), S (Segment Separator), or WS (WhiteSpace), it must be considered Non-joining, otherwise, it is Transparent. 3. Make the Arabic joining data automatically computable from Unicode data files. Include all the characters in the Arabic Block in the 'ArabicShaping.txt' file, or at least all the letters (U+0621 ARABIC LETTER HAMZA is missing.) Also, include the characters in the general category "Other, control" and "Other, format" in a separate section of that file. 4. Require the Arabic joining to be done after determination of bidi character levels of the text. The information should then be used to find the left and right characters for Arabic joining. Current specification allows doing joining before bidi, using the 'preceding' and 'following' characters instead of 'right' and 'left' ones (which will create problems in case of using bidi overrides for Arabic text). Require Arabic joinging to be separately done on each bidi level run, so characters in different bidi levels do not join. 5. Specify that some Bidi Boundary Neutral characters, including ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER should be retained for Arabic joining. (Unicode should also change their Bidi class from BN to something else, or refine the definition of BNs in the Bidi specification, so that implementations treat character sequences like Meem ZWNJ ZWJ Noon the same. Currently, they are allowed to render the above sequence either Noon[isolated] Meem[initial] or Noon[final] Meem[isolated] visually.) 6. Rename the "ALEF.LAM" ligatures to LAM-ALEF ligatures. Nobody calls them ALEF.LAM ligatures in real practice, nor it helps developers to think easier about Arabic ligatures.