The following are corrigenda to the Unicode Bidirectional Behavior Algorithm in the Unicode Standard, Version 2.0, pages 3-14 through 3-23.
The description of the scope of the algorithm as being within a block needs clarification. The description also does not make clear how CR and LF are to be handled on those systems that use them.
p3-16. At the end of the paragraph before the first bullet, add:
"The algorithm only reorders text within a block; characters on one side of a block separator have no effect on characters on the other side. (Also, see Section 4.3, Directionality on the handling of CR, LF, and CRLF)"
The following (together with a change to Reordering Resolved Levels) clarifies how to implement the last paragraph of page 3-16.
p3-17. Before Table 3-5, add:
"Combining marks are given the type of the preceeding letter."
p4-11. After "where there are gaps.", add:
"Combining marks are given the type of the preceeding letter, and are not called out in this table either."
Several of the rules are incorrectly worded to say global direction, when the embedding direction is meant. The latter term is more explicitly defined.
p3-18, before "Explicit Levels and Directions", insert:
"The direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd."
T6 removes the implicit directional codes, RLM and LRM. However, these codes only have any effect if they are processed by P0..P3, N1..N4, and I1..I2. This is clearly a mistake in the wording, and is in conflict with other statements about the use and effect of RLM and LRM. The explicit codes are used in N4, and can't be removed either. The reason for this rule is to allow the use of styles or stylesheets instead of embedding or override codes (see p 3-22), so that all rules could be implemented purely in terms of levels from this point on in the algorithm. This is handled by a change to N4 below.
P1, P2 are unclear in terms of their application. From their wording compared to N1, N2 you would figure that they apply to single characters.
In addition, P1 is ambiguous as to what happens if the numbers are of different types, and P2 is also subject to a serial-parallel ambiguity: do you serially apply the rules, and take the new types into account as you transform successive characters, or do you apply the rules to all the cases in parallel, only taking the original types into account?
Change to "P1. A single European separator between two European numbers changes to an European number. A single common separator between two numbers of the same type changes to that type."
Change to "P2. A sequence of European terminators adjacent to European numbers changes to all European numbers.
ET, ET, EN -> EN, EN, EN EN, ET, ET -> EN, EN, EN AN, ET, EN -> AN, EN, EN"
Add example at end.
"ET, AN -> N, AN"
The wording in N2 may also lead an implementor to mistakenly use the base direction instead of the embedding direction. (This also occurs in I1, I2.)
N3 uses the confusing term "letter", and may lead an implementor to mistakenly exclude strong R punctuation.
Replace "global" by "embedding"
Change "letter" to "character" everywhere.
N4 needs some significant clarification. It is stated in terms of embedding codes, which had been removed in T6. However, restoring those codes makes the rest of the processing rules more difficult. It may be unclear how code is to react when it runs into a LRO from the left (inside the embedding).
Moreover, N4 is in the wrong order in the flow of the algorithm, which makes the scope unclear. The wording can also be simplified, since no other rules are affected by characters on the other side of a strong (R or L) character.
p3-19,20. Move N4 to where T6 was. Change the number to T6, and change the wording and examples to:
"T6. In the following rules, an embedding or override code and its matching PDF act as if they were strong characters of the appropriate type. All unmatched PDFs are ignored. If two embeddings with the same level are adjacent, then the PDF terminating the first embedding and the code initiating the next embedding are ignored.
LRO ... PDF -> L ... L LRE ... PDF -> L ... L RLO ... PDF -> R ... R RLE ... PDF -> R ... R RLE ... PDF, RLO ... PDF -> RLE ..., ... PDF"
I1 and I2 have inaccurate wording, and may lead an implementor to mistakenly use the base direction instead of the embedding direction. (This also occurs in N2.)
Also, although the Table 3-7 refers to Sequence Type, the wording does not make it clear that the rules apply to sequences. This is only important in the case of EN, since the others are independent of intermediate characters.
Replace "global" by "embedding"
Replace "Numeric text (EN) goes up two levels unless preceeded
by left-to-right text." by
Change the example from "(L) EN" to "(L) EN...EN"
L1 may be misleading, since there can be only a single block separator. Also, the previous sentence in about "per-line" is rather brief.
Add to the end of the paragraph before L1:
"The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see pages 6-22 through 6-32). Logically there are the following steps:
Change in L1, "trailing white space (including block separators)" to "any trailing white space characters (including those of type B, S, and WS)".
Add after L1, "(Note: since a Block separator breaks lines, there will be at most one per line.)"
Before "Bidirectional Conformance", add:
"Combining marks applied to a right-to-left base character will at this point precede their base character. See Section 5.12 Rendering Non-Spacing Marks for an illustration of this. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character will need to be reversed."