3.11 Bidirectional Behavior

NOTE: This page superseded by the publication of Unicode 3.0, q.v.

The following are corrigenda to the Unicode Bidirectional Behavior Algorithm in the Unicode Standard, Version 2.0, pages 3-14 through 3-23.

Basic Display Algorithm

The description of the scope of the algorithm as being within a block needs clarification. The description also does not make clear how CR and LF are to be handled on those systems that use them.

Corrigendum

p3-16. At the end of the paragraph before the first bullet, add:

"The algorithm only reorders text within a block; characters on one side of a block separator have no effect on characters on the other side. (Also, see Section 4.3, Directionality on the handling of CR, LF, and CRLF)"

Bidirectional Character Types

The following (together with a change to Reordering Resolved Levels) clarifies how to implement the last paragraph of page 3-16.

Corrigendum

p3-17. Before Table 3-5, add:

"Combining marks are given the type of the preceeding letter."

p4-11. After "where there are gaps.", add:

"Combining marks are given the type of the preceeding letter, and are not called out in this table either."

The Base Level

Several of the rules are incorrectly worded to say global direction, when the embedding direction is meant. The latter term is more explicitly defined.

Corrigendum

p3-18, before "Explicit Levels and Directions", insert:

"The direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd."

Terminating Embeddings and Overrides

T6 removes the implicit directional codes, RLM and LRM. However, these codes only have any effect if they are processed by P0..P3, N1..N4, and I1..I2. This is clearly a mistake in the wording, and is in conflict with other statements about the use and effect of RLM and LRM. The explicit codes are used in N4, and can't be removed either. The reason for this rule is to allow the use of styles or stylesheets instead of embedding or override codes (see p 3-22), so that all rules could be implemented purely in terms of levels from this point on in the algorithm. This is handled by a change to N4 below.

Corrigendum

p3-19, T6.

Delete T6.

Resolving Weak Types

P1, P2 are unclear in terms of their application. From their wording compared to N1, N2 you would figure that they apply to single characters.

In addition, P1 is ambiguous as to what happens if the numbers are of different types, and P2 is also subject to a serial-parallel ambiguity: do you serially apply the rules, and take the new types into account as you transform successive characters, or do you apply the rules to all the cases in parallel, only taking the original types into account?

Corrigendum

p3-19. P1

Change to "P1. A single European separator between two European numbers changes to an European number. A single common separator between two numbers of the same type changes to that type."

P3-19. P2

Change to "P2. A sequence of European terminators adjacent to European numbers changes to all European numbers.

ET, ET, EN -> EN, EN, EN
EN, ET, ET -> EN, EN, EN
AN, ET, EN -> AN, EN, EN"

P3-19. P3

Add example at end.

"ET, AN -> N, AN"

Resolving Neutral Types (1)

The wording in N2 may also lead an implementor to mistakenly use the base direction instead of the embedding direction. (This also occurs in I1, I2.)

N3 uses the confusing term "letter", and may lead an implementor to mistakenly exclude strong R punctuation.

Corrigendum

p3-19. N2.

Replace "global" by "embedding"

p3-20. N3.

Change "letter" to "character" everywhere.

Resolving Neutral Types (2)

N4 needs some significant clarification. It is stated in terms of embedding codes, which had been removed in T6. However, restoring those codes makes the rest of the processing rules more difficult. It may be unclear how code is to react when it runs into a LRO from the left (inside the embedding).

Moreover, N4 is in the wrong order in the flow of the algorithm, which makes the scope unclear. The wording can also be simplified, since no other rules are affected by characters on the other side of a strong (R or L) character.

Corrigendum

p3-19,20. Move N4 to where T6 was. Change the number to T6, and change the wording and examples to:

"T6. In the following rules, an embedding or override code and its matching PDF act as if they were strong characters of the appropriate type. All unmatched PDFs are ignored. If two embeddings with the same level are adjacent, then the PDF terminating the first embedding and the code initiating the next embedding are ignored.

LRO ... PDF -> L ... L
LRE ... PDF -> L ... L
RLO ... PDF -> R ... R
RLE ... PDF -> R ... R
RLE ... PDF, RLO ... PDF -> RLE ..., ... PDF"

Resolving Implicit Levels

I1 and I2 have inaccurate wording, and may lead an implementor to mistakenly use the base direction instead of the embedding direction. (This also occurs in N2.)

Also, although the Table 3-7 refers to Sequence Type, the wording does not make it clear that the rules apply to sequences. This is only important in the case of EN, since the others are independent of intermediate characters.

Corrigendum

p3-20. I1.

Replace "global" by "embedding"

Replace "Numeric text (EN) goes up two levels unless preceeded by left-to-right text." by
"A sequence of one or more numeric types (EN) goes up two levels unless immediately preceeded by left-to-right text."

Change the example from "(L) EN" to "(L) EN...EN"

Reordering Resolved Levels

L1 may be misleading, since there can be only a single block separator. Also, the previous sentence in about "per-line" is rather brief.

Corrigendum

p3-20. L1

Add to the end of the paragraph before L1:

"The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see pages 6-22 through 6-32). Logically there are the following steps:

The levels of the text are determined according to the bidi algorithm.
The characters are shaped into glyphs according to their context (taking the embedding levels into account).
The accumulated widths of those glyphs (in logical order) is used to determine line breaks.
The glyphs on each line are then separately reordered according to the rules L1 and L2 below.

Change in L1, "trailing white space (including block separators)" to "any trailing white space characters (including those of type B, S, and WS)".

Add after L1, "(Note: since a Block separator breaks lines, there will be at most one per line.)"

Before "Bidirectional Conformance", add:

"Combining marks applied to a right-to-left base character will at this point precede their base character. See Section 5.12 Rendering Non-Spacing Marks for an illustration of this. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character will need to be reversed."