Proposed BIDI Changes

L2/09-179
Subject: Proposed Bidi Issues
Date: 2009-05-01
From: Mark Davis
To: UTC


There are three issues that have come up regarding the BIDI algorithm. In these cases, the language in the specification is not completely clear, or the rules that are provided differ from the textual gloss on them.

FYI: I posted an online demo of the bidi algorithm at http://unicode.org/cldr/utility/bidi.jsp. This demo can be used to see which rules are invoked at which points, and the resulting reordering.

The following reflects the consensus among the bidi@unicode.org participants on these three open issues in the UBA, with some additional editorial changes.

X9: http://www.unicode.org/reports/tr9/tr9-20.html#X9


Option 3. (new) Clarify that all rules apply to current types; add BN to the list of types in X6.

Note: Having any characters of class BN survive rule X9 is simply an unintended consequence of not having drafted rule X6 carefully enough. It's more consistent if these special characters either survive all or none.

At this point, we appear to have rough consensus on this option.

Suggested Text Changes:
(For clarification, see Editorial changes below.)

X6. For all types besides RLE, LRE, RLO, LRO, and PDF:
=>
X6. For all types besides RLE, LRE, RLO, LRO, PDF, and BN:

N1. http://www.unicode.org/reports/tr9/tr9-20.html#N1


Option 1. This seems to be what a majority of current implementations do. While it is not the best technical solution if we had a blank slate, we are constrained by compatibility issues.

At this point, we appear to have rough consensus on this option.

Suggested Text Changes:

R  N  R  → R  R  R

L  N  L  → L  L  L

R  N  AN → R  R  AN

AN N  R  → AN R  R

R  N  EN → R  R  EN

EN N  R  → EN R  R
=>
 L  N   L  →   L  L   L

 R  N   R  →   R  R   R

 R  N  AN  →   R  R  AN

 R  N  EN  →   R  R  EN

AN  N   R  →  AN  R   R

AN  N  AN  →  AN  R  AN

AN  N  EN  →  AN  R  EN

EN  N   R  →  EN  R   R

EN  N  AN  →  EN  R  AN

EN  N  EN  →  EN  R  EN

HL6. http://www.unicode.org/reports/tr9/tr9-20.html#HL6


We were not able to reach a consensus on this issue. Here are some of the options that have been proposed:
  1. Status Quo. Leave L4/HL6 as they are. All and only characters with resolved R can be mirrored.
  2. Discourage Only. Change HL6 to "strongly discourage", but allow mirroring of anything.
  3. Allow if Directionality Changes. Change HL6 to only allow mirroring if the default directionality changes (either R/AL/N* => L, or anything else => R).
    1. Would need new property to define the N*, eg Default_Right_Directional, and define the contents.
  4. All but specified. Change HL6 to allow all but certain characters to be mirrored (regardless of whether directionality changed).
    1. Would need new property to define this, eg Bidi_Never_Mirrored, and define its contents.

Editorial

1. In editing, I noticed that we don't explicitly associate the Unicode property names, such as Bidi_Class, with the older terms such as BD1.

BD1. The bidirectional characters types are values assigned to each Unicode character, including unassigned characters.

We should use the formal property names listed below, and explicitly equate the older terms to these. I suggest the exact wording be left to the Editorial Subcommittee.
  • Bidi_Mirroring_Glyph
  • Bidi_Class
  • Bidi_Control
  • Bidi_Mirrored
2. To clarify the phases, we need to make some changes at the start of Section 3 Basic Display Algorithm.

We discussed this in the bidi subcommittee, and the following reflects some of the discussion, but we didn't come to consensus on the text.

Old
New

The Bidirectional Algorithm takes a stream of text as input and proceeds in three main phases:

  • Separation of the input text into paragraphs. The rest of the algorithm affects only the text between paragraph separators.
  • Resolution of the embedding levels of the text. In this phase, the directional character types, plus the explicit format codes, are used to produce resolved embedding levels.
  • Reordering the text for display on a line-by-line basis using the resolved embedding levels, once the text has been broken into lines.

The Unicode Bidirectional Algorithm takes a stream of text as input and proceeds in four main phases:

  • Separation into paragraphs. The rest of the algorithm is applied separately to the text within each paragraph.
  • Initialization. A list of directional character types is initialized, with one entry for each character in the original text. The value of each entry is the Bidi_Class property of the respective character. After this point, the original characters are no longer referenced until the reordering phase. A list of embedding levels, with one level per character, is then initialized.
  • Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
  • Reordering. The text within each paragraph is reordered for display. Once the text in the paragraph is broken into lines, the resolved embedding levels are used to reorder the text of each line for display.

Note: I'm a bit concerned about drifting too far from the structure in