L2

L2/02-194

To:	UTC
Re:	BIDI subcommittee recommendations on Atkins paper and related issues
From:	Mark Davis
Date:	2002-05-02

(DRAFT!)

The following are recommendations for clarifying BIDI reordering and joining, and making two fixes to the process.

1. Clarify L2 in http://www.unicode.org/unicode/reports/tr9/#L2

Change L2 to

L2. From the highest level found in the text to the lowest odd level on each line, including intermediate levels not actually present in the text, reverse any contiguous sequence of characters that are at that level or higher.

2. Make the BIDI algorithm obey Canonical Equivalence

There are only a few edge cases where there are problems, where canonically equivalent sequences have different behavior:

U+0CC0 KANNADA VOWEL SIGN II: L
NFD: U+0CBF KANNADA VOWEL SIGN I, U+0CD5 KANNADA LENGTH MARK: NsmL
U+0CC7 KANNADA VOWEL SIGN EE: L
NFD: U+0CC6 KANNADA VOWEL SIGN E, U+0CD5 KANNADA LENGTH MARK: NsmL
U+0CC8 KANNADA VOWEL SIGN AI: L
NFD: U+0CC6 KANNADA VOWEL SIGN E, U+0CD6 KANNADA AI LENGTH MARK: NsmL
U+0CCA KANNADA VOWEL SIGN O: L
NFD: U+0CC6 KANNADA VOWEL SIGN E, U+0CC2 KANNADA VOWEL SIGN UU: NsmL
U+0CCB KANNADA VOWEL SIGN OO: L
NFD: U+0CC6 KANNADA VOWEL SIGN E, U+0CC2 KANNADA VOWEL SIGN UU, U+0CD5 KANNADA LENGTH MARK: NsmLL

To fix this, the proposal is to add a rule that requires the BIDI algorithm to behave as if the text is normalized to NFC. A note should clarify that this only affects a few characters (such as the above), so an optimized implementation does not need to normalize, as long as those few exceptional characters are properly handled.

3. Joining/Nonjoining

Joining is done after the BIDI algorithm, yet the algorithm explicitly deletes ZWJ and ZWNJ in X9:

X9. Remove all RLE, LRE, RLO, LRO, PDF, and BN codes.

(http://www.unicode.org/unicode/reports/tr9/#X9)

However, ZWJ and ZWNJ cannot simply be left in the character stream; one cannot simply retain them, because their effect is on the adjacent characters (in the original backing-store order). If the adjacent characters get rearranged so that they are not adjacent, then these characters would affect the wrong characters. We could add something like the following text, or make it an explicit rule.

The Zero Width Joiner and Non Joiner have an effect on adjacent characters (in the original backing-store order), but those characters may end up being rearranged to be non-adjacent by the BIDI algorithm. Thus, in order to determine the joining behavior of a particular character after applying the BIDI algorithm, an implementation must refer back to the original backing store to see if there were adjacent ZWNJ or ZWJ characters.

The above text is the logical process. An implementation could get the same results by replacing ZWJ and ZWNJ by an out-of-band character property associated with those adjacent characters, so that the information does not interfere with the BIDI algorithm, and it is preserved across rearrangement of those characters. Once the BIDI algorithm has been applied, that out-of-band information can then be used for proper display.

We could do the same with the language tags (although note that the language tags are discouraged, so we have not put a lot of time into worrying about their interference with other features!)

4. Joining Class T

Clarify the contents of class T. It is currently in a note in http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt and is explicitly listed in http://www.unicode.org/Public/UNIDATA/extracted/DerivedJoiningType.txt.

However, additional clarifying text should be added to the standard, since some people interpret T as only being Arabic non-spacing marks.

5. Order of Joining and BIDI

The standard needs to make it clearer that joining is to be done after BIDI (logically). It could be done beforehand in an implementation, but only if the same results are returned as if it were done afterwards.

6. Test cases

Use a uniform ASCII binding for test cases in the C and Java reference code, and make this binding more explicit in documentation.
Add more explicit BIDI test cases.