Proposed addition of AL MARK and LEVEL DIRECTION MARK

The UTC is considering proposals for two characters to help address various difficult issues in bidirectional text layout. These two characters are similar to the already-encoded LRM and RLM.

1. Proposed AL MARK (ALM)

The Unicode Bidirectional Algorithm (UBA), specified in Unicode Standard Annex #9, supports three strong Bidi_Class property values (also referred to as direction categories) for general text: L (Left-to-Right), R (Right-to-Left), and AL (Right-to-Left Arabic). R and AL differ in their effect on the resolution of the direction of subsequent characters in numeric expressions with Bidi_Class values EN, ES, CS, ET.

The UBA also currently provides two implicit directional marks: U+200E LEFT-TO-RIGHT MARK (LRM) and U+200F RIGHT-TO-LEFT MARK (RLM). These are invisible, zero-width characters that behave exactly like characters with Bidi_Class L and R, respectively. These characters are used to customize bidi text layout. They have no other semantic effect. As noted in UAX #9, “Their use is more convenient than using explicit embeddings or overrides because their scope is much more local.”

However, this set of implicit directional marks is missing an AL MARK (ALM), which is like the other two except that it has Bidi_Class value AL. This is needed in order to address some problems in the layout of numeric expressions. For example, consider an isolated field that should display a numeric expression in a way that would match what its layout would be if it were in the middle of Arabic text. To produce this layout, the field could use an ALM at the beginning of the numeric expression.

If necessary, an ALM could be inserted right after a RLO or RLE to ensure that the override or embedding begins with an AL direction context.

Adding ALM does not require any new Bidi_Class values or any changes to the definitions or steps of the UBA.

The UTC would appreciate any feedback regarding this proposed addition and its possible impact on implementations.

2. Proposed LEVEL DIRECTION MARK (LDM), previously referred to as EMBEDDING LEVEL MARK (ELM)

There are many instances in which semi-structured text is composed of two or more fields separated by neutral or weak-directional characters, and the fields should be laid out in order of the paragraph direction (or more precisely, the current embedding direction). For example, numeric dates in Arabic often have a logical order of d/M/y:

Because ‘/’ has Bidi_Class value CS and the digits (whether EN or AN) are weakly left-right, such a sequence will always be laid out left-to-right. Adding RLM before each ‘/’ will force the date to always be laid out right-to-left, regardless of direction context. If the direction context is known in advance then it is possible to insert RLM or not in order to generate appropriate behavior. However, it is impossible to create the correct behavior in all contexts. For example:

To handle situations of this sort, it is proposed to have a character which behaves like LRM or RLM, but whose Bidi_Class value is dynamically re-assigned based on the direction associated with the current embedding level. If the embedding level is L would behave like LRM, and if the embedding level is R it would behave like RLM.

Implementation issues

To handle LDM, the optimum solution would normally be to define a corresponding new Bidi_Class value, and then update the UBA to handle this new category. It could then be used to override the Bidi_Class value of selected characters, which—in situations that permitted such overrides—could achieve the LDM behavior without insertion of extra mark characters.

However, per the Unicode Character Encoding Stability Policy, “The Bidi_Class property values will not be further subdivided”. There is no such restriction on changes to the bidi algorithm itself, though for implementation stability, changes that impact backwards compatibility should be avoided. This leaves several alternatives:

  1. Define no new Bidi_Class value for LDM; instead, give LDM the Bidi_Class value ON (Other Neutral). Then define a new rule for UBA:

    W0. Examine each level direction mark character (LDM) in the level run, and set the bidi type to L if the level is even, and R if the level is odd.

    This has some problems:

    • This would reintroduce the case of a UBA rule dealing with a specific character, rather than a Bidi_Class value. This was formerly the case with the each of the embedding and override controls, but then (before the Bidi_Class values were frozen) new classes were introduced to avoid having the UBA deal with specific characters.
    • As noted above, if there were a separate BidiClass value for the LDM behavior then this class could be used to override the BidiClass value of separators when applying the UBA to various special strings and thus avoid adding LDMs in the middle of the data; not having a separate BidiClass value for the LDM prevents this usage.

    Not defining a separate Bidi_Class value for LDM will probably result in implementations effectively defining their own additional classes.

  2. Provide the LDM character, but make no provision in the UBA for its use. Instead, it would be available only as an invisible marker that could be tailored using higher-level protocols such as HL3.
  3. Leave the current Bidi_Class property values and UBA untouched, but instead define a new alternative set of Bidi_Class_V2 property values (including a value corresponding to LDM), and a corresponding UBA_V2 algorithm that handles them. This new algorithm and set of classes could then be extended to handle not only LDM, but also other bidi issues currently under discussion (such as URLs). The original UBA and Bidi_Class values would remain frozen for stability. This might lead to some interchange issues between platforms/applications using the old and new versions of the UBA; however, if the new version is designed carefully, these issues could be limited to areas in which interchange is already a problem (i.e. the areas for which the new behavior is intended)

The UTC would like feedback on which (if any) of these approaches is preferred.