L2/04-049

Re: Bidi Conformance
From: Mark Davis
Date: 2003-01-29

In the wake of the UTC last meeting, we had come to the conclusion that it was more important to have uniformity in the application of the Bidi algorithm than to allow the overriding of specific characters. We had considered having a letter ballot, but the delay of 4.0.1 rendered that unnecessary.

The following is a suggested revision of 4.3 Higher-Level Protocols, that removes the general ability to override characters. It makes a few other changes as well:

  1. Is more explicit about the meaning of setting the paragraph direction and overriding number handling, and clarifies some of the text. (See 4.3 Higher-Level Protocols for comparison).
  2. Adds category #4 (based on suggested requirements from Martin Duerst) allowing for reasonable display of XML.

An open issue is whether we want IRIs (URLs) to fall under the span of #4 or not.

Also, we have always had the little snippet "When text using a higher-level protocol is to be converted to Unicode plain text, formatting codes should be inserted to ensure that the order matches that of the higher-level protocol." There is one remaining circumstance where one cannot do that. If one wants to emulate #2 below, we don't have the formatting codes to do it.


4.3. Higher-Level Protocols

The following are permissible ways for systems to apply higher-level protocols to the ordering of bidirectional text.

  1. Override P3, and set the paragraph embedding level explicitly
    A higher-level protocol may set the paragraph level explicitly, and ignore P3. This can be done on the basis of the context, such as on a table cell, paragraph, document, or system level.
  2. Override W2, and set EN or AN explicitly
    A higher-level process may reset characters of type EN to AN or vice versa, and ignore W2. For example, style sheet or markup information can be used within a span of text to override the setting of EN text to be always be AN, or vice versa.
  3. Emulate directional overrides or embedding codes
    Within a span of text, a higher-level protocol can impose a directional override or embedding. The behavior must always be defined by reference to what would happen if the equivalent explicit codes as defined in the algorithm were inserted into the text. For example, a style sheet or markup can set the embedding level on a span of text.
  4. Interpret spans of text separately in the presence of markup
    When displaying a document consisting of textual data and visible markup, a higher-level process can handle syntactic elements in the markup separately from the textual data. For example, suppose an XML document contains the following fragment:

    One can consider this as consisting of 5 separate pieces:

    1. ARABICenglishARABIC

    2. <e1 type='ab'>

    3. ARABIC2

    4. <e2 type='cd'>

    5. english

    To make the XML file readable, the display can order these elements all in a uniform direction (e.g. all left-to-right), and apply the bidi algorithm to each field separately. Alternatively, the process could treat each of the pieces of syntax as if it were a single inline object, e.g. as U+FFFC OBJECT REPLACEMENT CHARACTER, and apply the bidi algorithm to the whole.

When text using a higher-level protocol is to be converted to Unicode plain text, formatting codes should be inserted to ensure that the order matches that of the higher-level protocol.