Bidi Issues

L2/09-073

Subject: Bidi Issues
Date: 2009-02-02
From: Mark Davis
To: UTC

There are three issues that have come up regarding the BIDI algorithm. In these cases, the language in the specification is not completely clear, or the rules that are provided differ from the textual gloss on them.

The first issue has to do with X9:

X9 Remove all RLE, LRE, RLO, LRO, PDF, and BN codes.

The issue was raised as to whether this applies to the characters on input to the BIDI algorithm, or the characters that have these classes at this point in the algorithm. The impact would be on characters that were BN, but became something else (L or R) as a result of rule X6.

In the entire BIDI algorithm, each rule applies to the characters of particular bidi types after any transformations of bidi types as the result of previous rules. In this particular case, if a BN character was changed by X6 then it would not be removed. My recommendation is that have a PRI recommending this course of action.

The only characters affected by this would be BN characters, which are Control and Format characters minus specific characters: tab, newlines, some Arabic subtending marks (like U+0600 ( ؀ ) ARABIC NUMBER SIGN), bidi controls, and the Interlinear annotation characters. And these characters would only be affected when they are in BIDI overrides.

We do need to have a PRI, however, because if it turns out that it causes implementation difficulties to change it, we may want to revamp the way the rules are done. For example, we may want to remove the BN characters before X6.

The next two issues have to do with N1.

N1 A sequence of neutrals takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers act as if they were R in terms of their influence on neutrals. Start-of-level-run (sor) and end-of-level-run (eor) are used at level run boundaries.

R  N  R  → R  R  R

L  N  L  → L  L  L

R  N  AN → R  R  AN

AN N  R  → AN R  R

R  N  EN → R  R  EN

EN N  R  → EN R  R

Note that any AN or EN remaining after W7 will be in an right-to-left context.

Here is text from the original bug report (from Behdad):

Bug 1:

The text of the first paragraph says "European and Arabic numbers act as if
they were R in terms of their influence on neutrals." It is not clear what
this means. There are at least the following two possible interpretations:

* The text is trying to loosely describe the logic behind the six rules that
follow and should not be taken literally. In particular, the sequences "AN N
AN", "EN N EN", "AN N EN", and "EN N AN" are NOT processed as if AN and EN act
like an R. This is most probably what the rule was meant to be. The text
however is definitely wrong. My colleague's testings suggest that this is
what OS X implements.

* Before applying the 6 rules listed, temporarily convert any AN or EN type
to R, then proceed to apply the rules. This reading is what I implemented in
FriBidi years ago. I just checked and the Java reference implementation also
reads it like this. I didn't check the code but I'm fairly sure that the C++
reference implementation does the same. The problems with reading it like
this are numerous:

- It conflicts with the 6 rules listed as there will be no EN and AN
anymore and the rules should be simplified to only:

R N R → R R R
L N L → L L L

- The major problem with this approach however is that it can produce
strongly RTL characters in an otherwise LTR paragraph. This is in consistent
with the following paragraph from Implementation Notes:

"""
One of the most effective optimizations is to first test for right-to-left
characters and not invoke the Bidirectional Algorithm unless they are present.
"""

Bug 2:

The last line in rule N1 reads: "Note that any AN or EN remaining after W7
will be in an right-to-left context." This is wrong as my example above
shows. The "L,AN" sequence reaches N1 fine and it's NOT in a "right-to-left
context", whatever that means. That sentence should plain be removed.

My analysis is that Bug 2 is uncontroversial, and we should just fix it.

Bug 1 is much more troublesome, and unlike the case with X9, a resolution either way will affect more characters. My initial response was:

The text is supposed to be a gloss on the rules, as Behdad says, and is in error since it is not meant to apply to EN N EN, and so on (the cases he lists). It was never supposed to affect text that had no R (or AL) characters in it. That is, the text should say something like EN or AN act like an R if there is an R on the other side of the N, matching the rules.

However, since people may have misinterpreted this, what we need to do is gather evidence as to how the most common implementations have interpreted them in products as factors in helping us to reach a consensus as to what the recommendation of the bidi group to the UTC should be for a fix.

So we'll have to look carefully at the situation. I'm out of town right now, and will have to do that for my part when I get back.

And Behdad summarized:

Resolution #1:

- Add rules for the missing cases (EN N EN, AN N EN, AN N AN, EN N AN) and
stick to the literal text of the first paragraph,

- The implementation note re optimization when there are no "right-to-left
characters" available should be reworded to "right-to-left or Arabic
characters (R, RLO, RLE, AL, AN)". That is, AN also triggers the case.

Resolution #2:

- Change the wording of the opening paragraph of N1 to what Mark suggests.
That is, something like "EN or AN act like an R if there is an R on the other
side of the N",

- The implementation note re optimization when there are no "right-to-left
characters" available should be clarified to "right-to-left characters (R,
RLO, RLE, AL)",

- Change the reference implementations.

In any case, bug #2 should be resolved as Simon suggests. That is, replacing
"any AN or EN remaining after W7 will be in an right-to-left context" with
"any EN ...".

As for implementation, fribidi takes the word of the paragraph literally so is
in line with resolution #1. But the Pango (the text layout used in GNOME)
adds a fast path before calling into FriBidi and the fast path only checks for
R, RLO, RLE, and AL. So my implementation is confused between the two.

I'll be looking for real-world test cases that may help decide one way or the
other.

I also suggested that we create a set of test cases to help with the assessment, and that we create a bidi conformance test file, that people can used to check their implementations. I produced a draft of both of these, which proved useful in discussions, and I think we should include them in a PRI.

Test Files:

http://macchiato.com/utc/bidi/