L2/99-038 From: medavis2@us.ibm.com Sent: Monday, February 01, 1999 7:56 PM To: Winkler, Arnold F Cc: book@unicode.org Subject: BIDI: Agreed Actions =============================================================================== Here is my list of normative items that were agreed upon at the meeting today. A reminder: these are normative in that they will change the results of a conformant implementation. Overall, the effects on the display by conformant implementations should be quite small--basically edge conditions are cleared up. Items marked by * are actually editorial (meaning that they will improve the statement of the algorithm, but not change results of conformant implementations), but are difficult to disentangle from the normative changes. Other editorial changes such as separating out definitions, making sure the same defined terms were used uniformally, renumbering rules for clarity, adding more examples, etc. are not included here, since those do not need decision by the UTC at the next meeting. We will discuss them once the UTC has approved the following (earlier, schedule permitting). These changes are deltas to TR#9 and Unicode Data 2.1.8. I will not generally repeat the arguments for making the change, since those are in previous notes. ============BIDI Character Properties============ A. New properties* a. BIDI properties AL, CM, LRO, RLO, LRE, RLE, PDF will be created* b. All characters with general category Me, Mn will be given BIDI property CM.* c. All characters of type R in the Arabic, Thana, Syriac ranges (0600-07BF, FB50-FDFF, FE70-FEFF) will be given BIDI property AL* d. The explicit embedding characters LRO, RLO, LRE, RLE, PDF will be given the corresponding property.* B. Related Algorithm Changes a. Unassigned Hebrew characters (0590-05FF, FB1D-FB4F) will be given type R.* b. Unassigned Arabic, Thana, Syriac characters (0600-07BF, FB50-FDFF, FE70-FEFF) will be given type AL.* c. All other unassigned characters will be given type L. d. We will add notes that as characters are assigned, these values might change, and that private use characters can be assigned different values by a conformant implementation.* e. Rules referring to combining marks will refer instead to CM. f. Rules referring to characters in the Arabic Block will refer instead to AL.* C. Reset the following individual characters to a new type (in parens): Lm ON (L) 3005 IDEOGRAPHIC ITERATION MARK Lm ON (L) 3031 VERTICAL KANA REPEAT MARK Lm ON (L) 3032 VERTICAL KANA REPEAT WITH VOICED SOUND MARK Lm ON (L) 3033 VERTICAL KANA REPEAT MARK UPPER HALF Lm ON (L) 3034 VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF Lm ON (L) 3035 VERTICAL KANA REPEAT MARK LOWER HALF Lm ON (L) FF9E HALFWIDTH KATAKANA VOICED SOUND MARK Lm ON (L) FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK Mc ON (L) 0F3E TIBETAN SIGN YAR TSHES Mc ON (L) 0F3F TIBETAN SIGN MAR TSHES Zs CS (WS) 2007 FIGURE SPACE Zs WS (CS) 00A0 NO-BREAK SPACE Zs WS (CS) 202F NARROW NO-BREAK SPACE Zl B (WS) 2028 LINE SEPARATOR D. Add "For the purpose of the BIDI algorithm, inline objects (such as graphics) are treated as if they are an U+FFFC OBJECT REPLACEMENT CHARACTER. E. Fix Tables 3-1 to correspond to the new data tables. F. Fix Table 3-2 to add explanations, and change sot/eot to start of block/end of block. Make clearer that blocks are treated separatedly.* ============Algorithm============ G. Change the maximum embedding level set by explicit controls to 61 (e.g. 6-bit limit). H. At the end of rules E2a, E3a, O1a, O2a, set RLE => R, LRE => L, RLO => R, LRO => L, respectively.* I. Change T6 to read as follows, and change the example as appropriate. "T6. If the character after a PDF is the same as the matching code for that PDF, set the PDF and that next character to BN.* Otherwise, if the PDF is immediately preceded by an embedding code, set that previous character and the PDF to BN." J. Incorporate new types by changing the following rules. (P0a has a fix for ET). "C0. A sequence of CM is given the type of the preceding character; at the start of a block, they are given the type ON."* "P0. Search backwards from each instance of a European number until the first strong character (or block boundary) is found. If that first character is AL, change the type of the European number to Arabic number:"* "P0+. Change all ALs to R."* "P0a. Change any Boundary Neutrals adjacent to an European Number to a European Number; otherwise change any Boundary Neutrals adjacent to an European Terminator to a European Terminator; otherwise change any remaining Boundary Neutrals adjacent to an Arabic Number to an Arabic Number." K. Change N3 to the following. The rules will be reordered or commented to make clear that N3 must be applied before N1. "N3. For the purpose of resolving neutrals, (a) European numbers are treated as though they were the type of the previous strong character. If this type is L, change the EN to L. (b) If there is no previous strong character, European number are treated as though they had the base direction. If this type is L, change the EN to L. (c) Arabic numbers are treated as though they were R. The following are examples R N EN -> R R EN L N EN -> L L EN EN N R -> EN e R EN N L -> EN e L R N AN -> R R AN L N AN -> L e AN AN N R -> AN R R AN N L -> AN e L" L. Change I1 to drop the "unless" clause (handled by additions to N3a and N3b).* I1. If the embedding direction is even (left-to-right), then the right-to-left text goes up one level. Numeric text (AN) goes up two levels. A sequence of one or more numeric types (EN) goes up two levels.* M. Change in 3.1 the following to add "should": The directional formatting codes are used only to influence the display ordering of text. In all other respects they are ignored--they should have no effect on the comparison of text, nor on word breaks, parsing, or numeric analysis. N. In "Terminating Embeddings and Overrides", delete: "Higher level protocols may choose to interpret PDFs that occur when there is no pushed state. For example, a presentation engine may receive blocks of processed Unicode text divided into lines. If the complexity of the text is limited by the higher-level protocol, then PDF can be interpreted significantly." O. In "Higher Level Protocols", change to something like: Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to conclude that the document was fundamentally Arabic, and that EN should generally be converted to AN. P. There was one open issue which the committee was split on. That was the proposal to add a new rule just before C0 that would change the behavior of text on both sides of TAB characters. More research will be done before the UTC, since committee members were concerned about the impact, and had questions as to the benefit. TAB0: Set Segment characters (S) to the base embedding type (R or L). Example of effect: Memory: abc_ABC?_DEF Display: abc_?CBA_FED (current) Display: abc_CBA?_FED (proposed) R. We agreed to not accept any more normative changes. Mark ___ Mark Davis, Program Director IBM Centre for Java Technology SV http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 (408) 777-5850 [fax: 5891] medavis2@us.ibm.com president@unicode.org