Re: Bidi Conformance
From: Mark Davis
Date: 2003-02-03

This is a draft revision that takes into account the discussion in the ad hoc bidi meeting during the UTC. It is being circulated among the bidi group for comments before presentation to the UTC.

In the wake of the UTC last meeting, we had come to the conclusion that it was more important to have uniformity in the application of the Bidi algorithm than to allow the overriding of specific characters. We had considered having a letter ballot, but the delay of 4.0.1 rendered that unnecessary.

The following is a suggested revision of 4.3 Higher-Level Protocols, that removes the general ability to override characters. It makes a few other changes as well:

  1. Is more explicit about the meaning of setting the paragraph direction and overriding number handling, and clarifies some of the text. (See 4.3 Higher-Level Protocols for comparison).
  2. Adds category #4 (based on suggested requirements from Martin Duerst and discussion in the meeting) allowing for reasonable display of XML.

4.3. Higher-Level Protocols

The following clauses are the only permissible ways for systems to apply higher-level protocols to the ordering of bidirectional text. Some of the clauses apply to segments of structured text. This refers to the situation where text is interpreted as being structured, whether with explicit markup such as XML or HTML, or internally structured such as in a word processor or spreadsheet. In such a case, a segment is span of text that is distinguished in some way by the structure.

  1. Override P3, and set the paragraph embedding level explicitly
    A higher-level protocol may set the paragraph level explicitly, and ignore P3. This can be done on the basis of the context, such as on a table cell, paragraph, document, or system level.
  2. Override W2, and set EN or AN explicitly
    A higher-level process may reset characters of type EN to AN or vice versa, and ignore W2. For example, style sheet or markup information can be used within a span of text to override the setting of EN text to be always be AN, or vice versa.
  3. Emulate directional overrides or embedding codes
    A higher-level protocol can impose a directional override or embedding on a segment of structured text. The behavior must always be defined by reference to what would happen if the equivalent explicit codes as defined in the algorithm were inserted into the text. For example, a style sheet or markup can set the embedding level on a span of text.
  4. Apply the bidi algorithm to segments
    The bidi algorithm can be applied independently to one or more segments of structured text. For example, when displaying a document consisting of textual data and visible markup in an editor, a higher-level process can handle syntactic elements in the markup separately from the textual data.

  5. Provide artificial context
    Text can be processed by the bidi algorithm as if it were preceded by a character of a given type, and/or followed by a character of a given type. This allows a piece of text that is extracted from a longer sequence of text to behave as it did in the larger context.

Clauses #1 and #3 are not logically necessary; they are covered by applications of clauses #4 and #5. However, they are included for clarity because they are more common operations.

As an example of the application of #4, suppose an XML document contains the following fragment. (Note: this is a simplified example for illustration: element names, attribute names, and attribute values could all be involved.)

This can be analyzed as being 5 different segments:

  1. ARABICenglishARABIC

  2. <e1 type='ab'>

  3. ARABICenglish

  4. <e2 type='cd'>

  5. english

To make the XML file readable as source text, the display in an editor could order these elements all in a uniform direction (e.g. all left-to-right), and apply the bidi algorithm to each field separately. It could also choose to order the element names, attribute names and attribute values uniformly in the same direction (e.g. all left-to-right). For final display, the markup could be ignored, allowing all of the text (segments a, c, and e) to be reordered together.

An IRI (international URI) can be analyzed as being structured text, and thus a higher-level protocol could apply clause #4 and order the segments in a uniform direction . However, the existence of this capability does not imply that in normal display this either should or should not be done.

When text using a higher-level protocol is to be converted to Unicode plain text, for consistent appearance formatting codes should be inserted to ensure that the order matches that of the higher-level protocol.