L2/11-398

Accumulated Feedback on PRI #205

This presents accumulated feedback as of Oct 24. It is divided into three sections: Feedback on proposed AL MARK, feedback on proposed LEVEL DIRECTION MARK, and feedback on other issues with the UBA received in the course of these discussions.



I. Feedback on proposed AL MARK

1. Feedback on the name:

Ernest van den Boogaard: AL MARK still uses a mnemonic. A better name would be “RIGHT-TO-LEFT ARABIC MARK” (from the UAX#9 AL description). [“ARABIC LETTER MARK” was also suggested, but this is misleading as already described].
(Also supported by Asmus Freytag.)

2. Correction to background document:

Kent Karlsson: The background document says: “If necessary, an ALM could be inserted right after a RLO or RLE to ensure that the override or embedding begins with an AL direction context.” Actually, an ALM inserted after RLO would make no difference.
(Also agreed by Peter Edberg.)

3. General feedback:

Kent Karlsson questions the need to distinguish R and AL in the first place, see issue Karlsson1 in section III, other UBA issues.

Asmus Freytag: Because the proposed ALM does not alter the bidi algorithm itself, nor introduce any new bidi class, I see no principled objection to its introduction.

The cost to existing implementations is primarily the introduction of another character that must be ignored in display (i.e. should never be rendered with a missing glyph box). That problem is addressed by encoding it in one of the ranges reserved for the addition of such characters. Well designed implementations would then handle this new character without modification.

Note, unlike RLM and LRM, which can be found on some keyboard layouts, the ALM would not be accessible immediately for user input. That would limit its role, initially, to software-generated plain text.



II. Feedback on proposed LEVEL DIRECTION MARK

Note, before this public review issue was posted, there was earlier feedback on ELM from Matitiahu Allouche via document L2/11-306; and Kent Karlsson has recently submitted an updated UBA v2 proposal via document L2/11-377.

A. Introduction

1. Use cases:

Example from PRI: An Arabic numeric date of the form dd/MM/yyyy in which the fields should flow left-to-right (e.g. ٠٩/١٦/٢٠١١) in a left-right context (i.e. the date and perhaps some other Arabic text are in a mainly Latin-script paragraph), but should flow right-to-left (e.g ٢٠١١/١٦/٠٩) in a right-left context (e.g. a primarily Arabic-script paragraph). The date may or may not be preceded or followed by Arabic letters. If the direction context is known when the date format is created, then RLMs can be used if necessary to force the desired flow. However, if the date format is being created from standard data (as from CLDR) and inserted, or if it is copied from some other context, it may not end up laid out as desired. To address this, an LDM could be used before and after the date, and before each of the ‘/’ in the date. This would produce the desired result in all cases.

Example from UAX #9 section 5.6: See http://www.unicode.org/reports/tr9/#Separators; in this case an LDM could be used before each ‘-’ to achieve the correct layout regardless of overall page direction.

Note: The behavior of ‘/’ between digits depends on language & locale as well as usage. For example, numeric fractions represented as numerator/denominator always flow left-to-right in Hebrew regardless of direction context (e.g. 1/3), whereas in a right-to-left Arabic context they flow right-to-left (e.g. ٣/١ or preferably ١\٣). This affects the degree to which heuristics can be used to determine LDM-like behavior.

Note: The advantage of LDM is that in many cases it can be used without much awareness of or tailoring for the specific content with which it will be used. The disadvantage, of course, is the difficulty of integrating with the existing UBA. So it may turn out to be something that is added only if we go to a UBA v2 for other reasons as well.

Verdy example: Another example where there are ambiguities on how to resolve the direction of characters other than CS. Check this page on Wikisource (has text in French containing a comma-separated list of Hebrew words; commas are incorrectly placed): http://fr.wikisource.org/wiki/Page:Diderot_-_Encyclopedie_1ere_edition_tome_1.djvu/96

Verdy was not able to find a working solution for correct display in Chrome (see discussion below in section C, Opinions).

Edberg: Note that for UBA (not Chrome), LRM (or LDM) after each comma solves the problem.

2. Overview of the following:

The remainder of this post first covers (in subsection B) alternatives presented to the proposal in the PRI:

It next covers (subsection C) the opinions expressed on UBA stability, and on the various proposals.

B. Alternate proposals

1. Proposal Verdy1, encode duplicate chars:

For the date example, encode another ‘/’ with bidi class R. More generally, encode new characters as necessary with another (existing) Bidi class. The problem: you might need to duplicate lots of characters - most whitespace, punctuation, and symbol characters - that are neither letters or digits or combining characters, and that don’t have a strong RTL or LTR directionality: much more than just CS characters. WG2 will most probably strongly oppose to this UTC proposal.

Edberg: Not sure how this solves the problem. For example, a numeric date format using ‘/’ with bidi class R would always be laid out right-to-left, instead of adapting to the paragraph or embedding level direction as desired.

Verdy:[Encoding duplicates of existing characters with different direction classes] is infeasible, because it would require cloning an indefinitely large number of characters that would be visually identical to others; they would invariably be mixed up in text, resulting in unpredictable rendering. [I think this is a retraction of this proposal. -mod.]

2. Proposal Verdy2, change UBA for embeddings:

A two-part proposal:

Using RLE..PDF and LRE..PDF solves all ambiguities, and requires no additional Bidi classes to be encoded (or even “implied”), and requires no new Bidi control. It does require a change to UBA rules for embeddings. It provides the safest solution.

In absence of markup, the existing CS class between two numbers should ALWAYS resolve using the bidi class of these numbers (this means that “1.2” would always be considered as a single number, and “31/12” would always be a fraction).

To change this meaning (and the expected rendering order if the embedding paragraph is RTL), there’s only one way: isolate the numbers in LRE..PDF, so that it prohibits the propagation of their strong directionality to the separating character of class CS.

The whole sequence “LRE(number)PDF” then externally has a weak direction, just like the surrounding CS characters, whitespace, and other punctuations, and all these runs will need to take their direction from the embedding paragraph.

In my opinion, encoding the proposed
  ONE, TWO, LDM, SLASH, THREE, ONE
should render exactly the same thing as with existing controls in
  LRE, ONE, TWO, PDF, SLASH, LRE, THREE, ONE, PDF
for representing a contextually commutative date “12/31”; and for the non-commutative fraction “12/31” you can protect the expected direction of fields with
  LRE, ONE, TWO, SLASH, THREE, ONE, PDF

By including the slash within the embedded span, if the existing UBA is correctly implemented. And it does not require any new character (LDM, or a duplicate SLASH), or any new Bidi class, or changing the UBA algorithm to treat CS characters specially.

The example in UAX #9 section 5.6 is handled if the field separator is <PDF, HYPHEN MINUS, RLE> or <PDF, SLASH, RLE> or <PDF, FULL STOP, RLE> or <PDF, COLON, RLE>, and the leader is RLE and the trailer is PDF.

(Substitute RLE by LRE everywhere as you want: this is equivalent if the numeric fields are using European or Arabic digits, this may only change if a field is an abbreviation starting by a whitespace or variable character, or if it mixes LTR letters and Arabic digits).

This does not only concerns date values, this may apply to time, phone numbers, numeric identifiers that use separators such as social security numbers, indexes of TOC entries, various sub-classification schemes used in book libraries or even technical protocols (including DNS or LDAP names)…

LRE..PDF and RLE..PDF also have a bijective mapping with the well-known HTML "dir=" attribute of inline elements, when it gets mapped into the equivalent CSS style property that can map this dir= attribute with bidi embedding values, so that these Bidi controls (strongly not recommended in HTML) can be avoided completely. This means that a date like visually rendered “12/31/2011” in a LTR-only document can be formatted in HTML as:
  <span dir="ltr">12</span>/<span dir="ltr">31</span>/<span dir="ltr">2011</span>
so that it will be reordered contextually as “2011/31/12” depending on the contextual direction of the inline text before it (or after it if there’s no strong direction set by previous inline content within the same containment block or embedded span, and remaining in weak direction if there’s no strong context at all in that block or span, so that the block or span will itself inherit the direction from a lower contextual embedding level, or from the default direction set by the document language if there are no more context).

For this reason, I am convinced that, in absence of such embedding, the CS characters should always limit their context to only their immediate neighboring characters, so that “12/31/2011” will always keep the same direction of fields in all contexts, in absence of such markup by an external stylesheet, or Bidi controls, and that “12/31+2” will NEVER be reordered contextually (preserving the mathematical meanings of operators by which operands cannot freely commute)

Essentially I think LRE...PDF (and similarly for the other start bidi bracketings) should behave as if they had an inherent LDM (LEVEL DIRECTION MARK).

Edberg: With Verdy’s proposed change to the directionality associated with an entire embedding (that it has weak or neutral directionality, instead of strong directionality associated with the specified embedding direction), this would appear to work.

Karlsson: If one changes LRE etc. to have an inherent LDM *functionality*, an actual character for LDM is not needed, *nor* is a new bidi category needed. The function of an LDM character can then be achieved by <LRE, PDF> (or <RLE, PDF>, <LRO, PDF>, or <RLO, PDF>); note: empty string between the start and end bidi control codes.

I still think there are plenty of other reasons to go for a UBA v.2; also the change suggested here is probably best done in a UBA v.2 rather than in the current UBA.

3. Proposal Wordingham1, LRM and RLM instead of LDM:

LDM is superfluous, can always mimic it with combinations of LRM and RLM. The disadvantage of not having LDM is that the alternative rules are complex. European digits (EN) are extremely complicated, as one has to consider the preceding strong character - L, R, AL or LDM.

For the examples given: Embed the common separators such as ‘/’ in RLM...LRM. This would ensure that they took on the directionality of the embedding.

I have demonstrated to my satisfaction that text with LDM can be converted to text without LDM that should display the same, under the following debatable assumptions:

Karlsson: I’m not at all sure the suggested workaround works in general, and not just in a few examples.
Edberg: I am not convinced that LDM can always be mimicked by a combination of existing controls; see discussion below. Also, Wordingham’s substitution rules & tables (sent to me separately) state “W0 can actually be deferred to between W6 and W7 without changing anything.” I do not agree; resolution of LDM to LRM or RLM in W0 could affect, for example, the change of EN to AN in W0 (this effect might be an improvement, but it would be a change nonetheless). The proposed ALM will help address these issues.

a) For date example:

Edberg: For an isolated instance of such a date, RLM...LRM around ‘/’ does work; having opposite strong directions on either side of the neutral forces it to take on the direction of the embedding level, by rule N2, and though the extra RLMs and LRMs will get moved around in layout depending on the embedding level, it does not matter because they are invisible.

However, it does not handle the situation in which the date is part of other text, possibly preceded or followed by Arabic letters (with an intervening space); there are layout interactions between the Arabic letters and adjacent Arabic digits, since the digits are not treated as being part of a longer sequence due the direction marks associated with the ‘/’. This can be solved by placing an LDM before and after the date, as well as before each ‘/’. However, using an RLM LRM sequence before and after the date causes the spaces around the date to reorder.

Wordingham: The interaction between Arabic letters and Arabic digits that are part of the date occurs in a left-to-right embedding. The non-LDM solution in each case is to insert an LRM after the separating space.

Incidentally, if Arabic digits (AN, not EN) are used, the separators should be terminated by LRM on one side, not by both RLM and LRM.

The outer marks are to protect against adjacent R and AN in a *left-to-right* context:
Storage: ' ' LRM AN1 '/' LRM AN2 '/' LRM AN3 ' ' LRM
Display in left-to-right context: ' ' AN1 '/' AN2 '/' AN3 ' '
Display in right-to-left context (e.g. after AL in memory): ' ' AN3 '/' AN2 '/' AN1 ' ' (AL)

Actually, though, <LRM ' ' AN1 '/' LRM AN2 '/' LRM AN3 ' ' LRM> would be still better, as that also works if there is a preceding L in an RTL context.

However, if the text receiving calculated text is directional, I don’t think it unreasonable to require that the receiving text separate calculated text from elements in the receiving text with the contrary direction. (In this context, the directionality of AN is contrary to left-to-right text.) After all, a comma-separated list of Hebrew words in left-to-right text would naturally be written with a separator sequence such as <U+002E LEFT-TO-RIGHT MARK, U+002C COMMA, U+0020 SPACE>. The commas are to the left of the spaces.

Therefore, I think it reasonable to require the guarding LRMs above to be provided as part of the receiving left-to-right text. Because the date uses Arabic digits, no protection is needed in right-to-left text. Protection would be needed if the date used “European” digits, as it presumably would in Persian or Urdu text. (“EXTENDED” ARABIC-INDIC DIGITs are EN!)

Edberg: While the last formulation above does appear to work in cases where spaces are needed, it is still that case that different formulations are needed depending on whether spaces are required. A date format using LDM as follows would work for all contexts (and with or without spaces):
  LDM AN1 ‘/’ LDM AN2 ‘/’ LDM AN3 LDM

b) For UAX #9 section 5.6 example:

Edberg: For the example in UAX #9 section 5.6, using RLM and LRM around the ‘-’ causes reordering of the adjacent spaces, while using LDM before each ‘-’ solves the layout problem.

Wordingham: Of course, the problem of spaces is cured if one uses <RLM, SP, HYPHEN-MINUS, SP, LRM> as the bounding delimiter.

Verdy: And no problem at all of ordering if the field separator is <PDF, HYPHEN MINUS, RLE> or <PDF, SLASH, RLE> or <PDF, FULL STOP, RLE> or <PDF, COLON, RLE>, and the leader is RLE and the trailer is PDF. [Per proposal Verdy2, see discussion above -mod.].

Edberg: The Wordingham formulation above does appear to work for either embedding/paragraph direction (i.e. this text can be inserted without knowledge of the context into which it is being inserted).

4. Proposal Karlsson1, UBA V2 with heuristics:

UBA V2 proposal: Make begin/end paired (i.e. bracketing) punctuation work (better) out of the box:

g.c. Ps, default for Pi: Have an implicit just after the character and an implicit LDM just before the the character. The choice between LRE and RLE is done in the similarly as determining default paragraph direction, but only looking at an appropriate substring, and with the surrounding direction as default.

g.c. Pe, default for Pf: Have an implicit just before the character and an implicit LDM just after the the character.

Whether a Pi or Pf character is regarded as begin or as end character should be possible to override by a higher level protocol (f.i. a locale setting, including language tagging for parts of texts, specifying begin/end quote marks; though there is a difficulty when the begin and end quote marks are the same character).

UBA V2 proposal: make segmenting punctuation work (better) out of the box: g.c. Pd (and more, see below): If both adjacent characters to the Pd character do not have strong directionality (perhaps limit to SPACE), have an implicit LDM on both sides of the Pd. IIUC, this should make the Pd behave as a TAB (bidi category S), but at the embedding level, rather than the paragraph level.

Rename S, “Segment Separator” (currently the characters TAB/HT, VT, and INFORMATION SEPARATOR ONE/US) to “Paragraph Level Segment Separator”; and refer to a character with implicit LDMs, but no implicit LRE/RLE/PDF, as an “Embedding Level Segment Separator”.

Higher level protocols should be able to set other characters (than Pd with no strong bidi character directly adjacent) as Embedding Level Segment Separators. For instance, for domain names and URLs, FULL STOP, SOLIDUS and a few more URL punctuation characters should be handled as Embedding Level Segment Separators (for this to work well, there needs to be an implicit LRE or RLE at the beginning of the URL and an implicit PDF at the end of the URL). One should also consider making COMMA, SEMICOLON, FULL STOP, QUESTION MARK, and similar characters (when followed by SPACE or an ending paired punctuation) into Embedding Level Segment Separators in general, now that most LRE/RLE/PDF are implicitly inserted by bracketing punctuation.

It is important that this works in the vast majority of cases *without* the *explicit* insertion of invisible (bidi control) characters. The latter is highly user unfriendly, and is unlikely to work well with cut-n-paste.

I realize that going for a “bidi V2” is a major step, but I think it is called for for several reasons, particularly the ones mentioned above.

Note1: It is also possible to use a different analysis for special cases, e.g. domain names or URLs (if detectable somehow, e.g. via markup).

Note2: It is not the case that all bidi control characters can be avoided in all cases using my suggestion. But a great many cases, many that surprise users, would with the implicit bidi control approach work with much less surprise, and no need to insert explicit bidi controls (something which is not so easy).

Verdy: And how will you define what is an “implicit” LDM ? For example “1.2” have two interpretations (a single number with a decimal separator, fields of digits have a fixed relative order and the dot is part of that number; or a notation of two distinct numbers in fields separated by the dot, the fields being assumed to be displayed in the same direction as the embedding paragraph). Same thing about “31/12” (is it a date made of two fields to render in the embedding direction, or a fraction whose operands must NOT commute ?)

As this is impossible to determine, I really think that in absence of markup, the existing CS class between two numbers should ALWAYS resolve using the bidi class of these numbers (this means that “1.2” would always be considered as a single number, and “31/12” would always be a fraction).

To change this meaning (and the expected rendering order if the embedding paragraph is RTL), there’s only one way: isolate the numbers in LRE..PDF per proposal Verdy2.

Edberg: Considering for the moment just the portion above dealing with “segmenting punctuation” (Pd etc.) and segment separators, since that is related to the intent of LDM: Karlsson suggests the addition of a new bidi class “Embedding Level Segment Separator”, which basically has the directional behavior of LDM, and could be applied to any character (for example ‘/’ in the date example) as an override. This aligns with option 3 in the background document (and once there is an LDM-like bidi class, I think it is a small step to actually encode a character like LRM or RLM that has this new class). Mr. Karlsson further suggests that characters of general category Pd, at least if they are surrounded by space, should behave as if they have this “Embedding Level Segment Separator” class. This would mean that the example in UAX #9 section 5.6 would be handled without explicit LDM. This is an interesting idea, but I have not examined all of the implications.

Note: Kent Karlsson has submitted an updated UBA v2 proposal as UTC/L2 document L2/11-377.

5. Proposal Karlsson2, reinterpret S:

Another possibility, as long as we are just “brain-storming” a bit here, is to use the bidi category S (Segment Separator) for the LEVEL DIRECTION MARK (which would be a normally invisible (bidi) format control character). I.e. it would work just like TAB (as specified in the UBA), except that it wouldn’t do tabbing. But then it would work only for the paragraph bidi direction. However, the idea that TAB (and the other bidi S characters) magically cuts through *all* nested bidi levels seems a bit strange to me... Going just to the closest explicit embedding/(override) level seems less drastic. Without formally subdividing “S”, one could treat different “bidi S” (old and new) to reset to different levels (to the embedding bidi level for the new one, and to the paragraph bidi level for the three old ones). (I know, this would be a form of “option 1” in the PRI.)

C. Opinions on UBA stability and PRI options

1. Opinion Freytag1a:

Adding controls would imply the creation of new bidi classes for them, and giving up stability to that degree would be a serious issue.

Stability is paramount for predictability. You need to be able to predict what your reader will see, and you will only be able to do that, when you can rely on all implementations agreeing on the details of how to lay out bidi.

Introducing any new feature now, will result in decades of implementations having different levels of support for it. This makes the use of such a new feature unpredictable - and is a problem whether there was a formal stability guarantee or not.

The LDM strikes me as a not insignificant departure from the basic bidi algorithm and essentially in contradiction to the spirit if not the letter of the stability guarantees. It brings with it, therefore, the risks of instability and incompatibility. Because UBA is such a basic and strongly required algorithm, stability guarantees are especially important. This includes the implicit guarantee that the bidi classes are the complete description of a character under the UBA.

2. Opinion Freytag1b:

Leaving aside the question whether the LDM as proposed is the right approach to address the supposed problem, it should definitely be put off until such time as the bidi algorithm itself (including the set of classes) can be versioned (e.g. UBA 2.0) or extended by formal extension (privileged higher level protocol, or Super-Bidi).

The idea here would be that “generic” implementations would continue to be required to support the Bidi algorithm as it exists today, while opening the option for certain contexts to require UBA 2.0 or Super-Bidi - usually this would be in the context of some higher level protocol. I’ve used the term “privileged” here because UBA 2.0 or Super-Bidi would offer different building blocks than the regular bidi algorithm (UBA 1.0), therefore allowing behaviors that consenting implementations would be hard pressed to achieve by the currently allowed overrides.

This could include additional bidi classes, as well as additional (or modified) rules.

If a large enough set of “thorny” problems were addressed by such versioning or extension, the benefit would be that “advanced” implementations would again have to implement only a common algorithm and not dozens of different, context-specific enhancements.

3. Opinion Verdy0 and discussion:

The stability policy was published too soon before solving evident problems.

Freytag: I disagree. True plaintext bidi will always be a compromise, because there’s a lack of information on the intent of the writer. (In rich text, you can supply that with styles). There’s a limited workaround with bidi controls, but that’s beginning to be a form of minimal rich text in itself.

4. Opinion Davis1 and discussion:

The bidi stability clause, in retrospect, was badly written. It doesn’t prevent breaking changes to the BIDI algorithm, but does complicate extensions. I think it was a fallout from the bad experience we had with the GC, where we decided not to add new property values because people had switch statements based on the old ones. Because the GC logically had a hierarchy (Symbol, Punctuation, Letter,...) but didn’t actually incorporate that structure, moving characters from one punctuation subtype to another would cause them to be not recognized as punctuation by old implementations.

It is quite a different matter when introducing a new type of character, and only applying it to a new character. Old BIDI implementations wouldn’t recognize the new character, but if they were updated to the new version of Unicode — with an attendant, minor, code change — they could work with the new character. Of course, like other cases with the introduction of a new character, it would be some time before the majority of major implementations supported it, and it could be generally used.

We can, however, respect the stability clause in 2 alternative ways. One is to have the BIDI algorithm depend on the character code, not the BidiClass. The other is to define a new BidiClassExtension that has the new code. There are pluses and minuses to each approach.

Freytag: The way I have always parsed the “spirit” of the stability guarantees for the bidi algorithm is that it was stable — except as to the additions of new letters. The policies effectively guaranteed that implementations could be written in a way that only required updating the property tables to account for new characters (leaving aside the occasional “bug fix”).

My argument is that, given the universal requirement to support this particular bidi algorithm for the sake of predictability and interoperability, this was a beneficial state of affairs. From the outset, the tradeoff was made that, for example the character ‘/’ could be supported either as the date separator or as the math operator, but not both. One or the other usage always would need overrides.

In addition, the default character properties were designed such that the addition of characters would cause minimal disruptions. Any strong character would be assigned in an area matching its directional property with the earlier default property value. While the same was not true for punctuation and numeric characters, the implied hope was that at least the edge cases that would show up their different behaviors were infrequent.

As a result, you could expect any existing implementation to show the same ordering for the vast majority of texts containing characters beyond the ones that it was explicitly updated for.

Giving a new character a totally novel bidi class (or behavior) destroys this interoperability. The good majority of texts containing this new character would be ordered differently by a

downversion implementation. That’s especially of concern, because the new character would have been added, deliberately, to achieve a specific effect.

This approach (as well as it’s proposed alternate) would do away with the implicit guarantees of interoperability that are inherent in not only the particular stability policies, but the larger attempts to make the UBA cross-version interorperable as much as possible.

Further, no matter which route was chosen, this change would destroy reliance on a particular maintenance strategy that had been implicitly blessed by the Consortium (change property tables only). While the changes to each implementation simply to account for the LDM might be small, the problem is that there exist too many implementations, and there is often no good way to know which implementation a text is viewed by.

For these reasons, I argue, that any such disruptive change, where necessary, needs an explicit version of the bidi algorithm (as opposed to just a new version of the Standard). It would be a “new” UBA, UBA-2.0 or whatever you’d like to call it. You’d probably best off with collecting additional changes. such as the ones proposed by Kent Karlsson. In addition, the use of this “Super UBA” needs to be embedded in certain Higher Level Protocols (such as HTML5.x) so that users have a chance of predicting which environment supports the new features.

In the particular case of the LDM, I’m not convinced that its design is final enough to spend time on it. There are too many alternative suggestions that merit investigation before putting this up for a decision. I would expect the UTC tnot to decide on any of these at this round but direct someone to arrive at a consolidated proposal for more focused public review.

5. Opinion Verdy1:

Adding ELM or any control is effectively creating a new Bidi class, because it requires all existing UBA implementation to be updated to accept the new behavior. The ELM proposal breaks the stability promised for the bidi classes and UBA. None of the proposed 3 solutions are in fact acceptable if stability is required.

6. Opinion Verdy2 and discussion:

The stability of Bidi classes should only concern existing characters, it should not limit new characters (or new scripts) that may need new Bidi classes, as it would not break existing texts rendered in existing implementations. The stability is just meant to NOT break any bidi rendering of existing fonts that use assigned characters. For existing unassigned code points, there’s simply never been any stability warrantied for any property, so you can assign the properties much more freely.

I am convinced that if you need new characters, the only good question is which ones?

The second option is certainly the least disturbing (and the most economical in terms of encoding, and the most likely to be accepted without much troubles by voting NBs in WG2).

Edberg: This seems to contradict the statements in Verdy1 above, and suggests that you would have no objection to a new character LDM with a new bidi class.

It does not break the policy on ANY existing encoded texts. It gives NO surprise to users, or at least they know that something is missing, and their decision for what to do will be exactly like when they are presented newly encoded texts containing newly assigned characters for which they still don’t have a supporting font or any support in their existing renderer for the complex shaping/layout features required by a newly encoded script.

In other words, the UTC policy about the stability of Bidi classes should be minimally relaxed, by rewording into something like the following, which is is similar to the rule of immutability of other normative character properties of assigned code points (code point value, character name, decomposition mapping ...):

“The bidi class property value of any assigned code point is IMMUTABLE (and will never change for the same assigned code point in any subsequent versions of the UCS).”

I can accept that the full set of possible values for the general category is restricted and inextensible, because these categories are frequently used in algorithms where the GC is supposed to be fully partitioned with a constant number of elements (a fixed enumeration) for implementing lots of other algorithms or derived properties. But the Bidi class for characters is just meant for the rendering, and has no other use than implementing the UBA itself; it should never be used for any exclusive yes/no decision.

With respect to proposal Karlsson2: If you want to encode new characters, why would you restrict yourself to reusing an existing bidi class just to break it? Instead of speaking about the poorly defined concept of “splitting the bidi classes” - if you add a new bidi class for new characters, you effectively never split any existing bidi class, and you don’t break the IMMUTABILITY rule I gave above.

Karlsson: The stability guarantee says “The Bidi_Class property values will not be further subdivided.” I’m not too keen on the word “subdivided” here, but it (here) means there will be *no additions* to the set of values for the Bidi_class property. Not even for new characters.

As far as I can tell, there is no restriction saying that the bidi algorithm cannot look at code points as well as bidi category values.

Verdy: That’s absolutely not the way I understand it, notably if you consider the term “further”, which references what was done before, where subsets of characters that were listed in the same class have been later partitioned into separate classes, before the policy was adopted. I am not advocating changing the bidi classes already assigned to characters, just that currently unassigned characters are already outside of any one of these classes.

Wordingham: Re “The bidi class property value of any assigned code point is IMMUTABLE”: At most it should become immutable after being unchanged for, say, 20 years. It is unwise to prohibit correction.
Verdy: Not needed. Even if there’s an error, it will be much better to re-encode the character with a new Bidi class, and not break the many texts already containing the character (note we’re discussing widely used characters such as generic punctuation). Corrections should be made in the early stages of beta releases, or based on documents that lead to the initial encoding approval for encoding. This should come only as erratas coming extremely fast (one or two months?), and caused only by discovered editorial problems which contradict the prior decision. After 20 years, as you suggest, the cost for correcting the encoded documents would be excessive, and the correction will not be applied consistently before about the same time or more. This would mean at least about 30 years of instability, and lots of data losses in that period (which will be extremely hard to estimate in time and total cost supported by all users of the UCS).

7. Opinion Verdy3:

[Per proposal Verdy2 , Verdy feels it is OK to change the current UBA so that the embeddings have a different effect on the direction resolution of characters outside the embeddings - i.e. the embeddings as a whole would not have strong direction. He prefers this to any of the other proposals, and feels that this would have less of an effect on stability and existing implementations. Not sure how this fits with the opinions expressed in Verdy1 and Verdy2 above. -mod.]

Proposal Karlsson1 breaks the UBA in a non-conforming and incompatible way. I’m now sure that LDM is not even needed if the UBA is implemented correctly [presumably per proposal Verdy2 - mod.]

If browsers already have problems in correctly implementing the UBA [as with the Verdy example in section A.1 -mod.], it will be even more difficult to convince browser authors to make new adjustments if you change the behavior of CS characters in the UBA… [presumably referring to proposal Karlsson1 -mod.]

But it will be far more easy to convince them to accept a new character that uses one of the existing Bidi classes, even if the character is superficially the same (but you’ll get strong oppositions from WG2 that will be hard to convince if you want to disunify some characters with a duplicate encoding only for a distinct Bidi property).

Even better is to correct the UBA to get the expected full restoration of context by PDF, rather than adding a new LDM (and an associated new class) which will still require a change in the UBA to be effective, and that will also break the Unicode stability rule.

Edberg: Hmm, per opinion Verdy2 you were suggesting that adding a *new* character with a new Bidi class would *not* break the stability rule.

8. Opinion Verdy4:

[Encoding duplicates of existing characters with different direction classes] is infeasible, because it would require cloning an indefinitely large number of characters that would be visually identical to others; they would invariably be mixed up in text, resulting in unpredictable rendering. [I think this is a retraction of proposal Verdy1. -mod.]

9. Opinion Karlsson1:

Option 1 [no new class, change UBA to handle LDM specially] is likely to result in implementations effectively defining their own additional classes, as noted.

Option 2 (define LDM for higher-level protocols only] is a no-go. I do see other uses for higher level protocols affecting bidi processing, though. But not this way, just for doing a one off circumvention.

Regarding option 3 [new UBA v2], it would also be an opportunity to do away with the difference between R and AL (and hence make ALM moot, and maybe also remove the EN and ES bidi classes (for “V2”).

Indeed, if going by option 3, we could take the opportunity to improve the bidi algorithm (V2). Some things don’t work as they should do “out of the box”. It is often said “just leave it to the bidi algorithm, it will do the right thing”. But it much to often does NOT do the right thing. Two major defects are detailed in Unicode standard annex 9, in sections 5.5 and 5.6. The bidi algorithm has glaring deficiencies that I think would be best handled by going for option 3 UBA v2, where these glaring deficiencies can be addressed; to a large extent by the use of *implicit* LDMs (and *implicit* LRE/RLEs and PDFs).

10. Opinion Karlsson2:

All the workarounds w.r.t. LDM depend on the directionality of neighboring characters, not directly on the embedding level direction. Therefore I think none of them will work properly in all cases (even though they may give the seemingly correct result in many cases). And they all require an inordinate amount of insertion of bidi control characters. (Much better to have *fewer* bidi control characters and still get a desirable display.)

Verdy: Marking the slash in “12/31” with LDM, it will not solve any ambiguity. The only safe way is to use embedded levels.


III. Other issues with UBA

1. Issue Verdy1, weak direction embedding

I’d like to add a remark: we can embed strong LTR or strong RTL sequences (with LRE/RLE...PDF), but there’s currently no way to embed sequences that should start by characters with weak direction (in fact all characters except letters: the RLE or LRE start is too strong, we would also need a WDE control, for Weak Direction Embedding, where the start of the internal substring would have its content adopt the direction of the first string character in it, and if not found, would then inherit the direction of the outer text, recursively). Maybe we could emulate it using RLE,B..PDF or LRE,B..PDF (but with which B character ?). I know this can be tricky, because the UBA currently starts by splitting paragraphs independently, dropping all contexts existing before them, before resolving direction levels.

I wonder how the WDE..PDF feature would be best supported (or emulated) if using existing Bidi controls (including by implicit insertions, for example between paired bracket punctuations Pi and Pf)

Edberg: Deborah Goldsmith has also suggested a “native direction embedding” which is like LRE/RLE but uses the inferred primary direction of the embedded text. I will try to put together a proposal about this for the next UTC.

Karlsson: And [native direction embedding] is indeed what I suggested for ‘(‘, ‘[‘ and other beginning punctuation (general category Ps and default for Pi) in my response to the PRI, *without* necessarily actually having a new *control character* for it. Ending punctuation (Pe and default for Pf) would in my suggestion act like bidi PDF. Note that the beginning and ending punctuation also must take on the current (surrounding) embedding directionality, which unfortunately LRE/RLE/PDF characters by themselves don’t do in current UBA; and one must of course not do rule X9 for characters that aren’t pure bidi controls.

2. Issue Verdy2, interlinear annotations:

I have another problem with the Bidi algorithm: it does not work correctly with interlinear annotations (for exactly the same reason: the text between IAS and IAT should be embedded internally with a weak direction, and the direction of the text after IAT should not depend at all of the text between IAS and IAT, but only on the text between IAA and IAS, or before the annotation anchor (so IAA can be ignored by the Bidi algorithm, IAT should behave like PDF, but which of the LRE or RLE class should IAS use? IAS would also need the same class as the missing WDE described in the previous paragraph!

The only very poor way to create a sort of WDE would be to encode RLE or LRE immediately followed by a paragraph separator (CR, LF, NL, PS) whose vertical advanced should be canceled in the rendering, because paragraphs always start with a weak direction for Bidi processing.

So to handle interlinear annotations, you would insert two Bidi classes for this character, instead of just one:

And I’d like to propose that the WDE sequence be encoded as <LRE, NL>, and with the intent that the NL (or CR or LF, or CR+LF or PS) would not be rendered with its implied vertical advance in the context of a previous LRE or RLE.

Both proposals do not require changes in existing Bidi classes, but still changes in the UBA to handle these special (forgotten) sequences:

Note that the equivalent of interlinear annotations in HTML is the ruby notation, and in CSS the absolutely positioned blocks with “display:inline-block;position:absolute”. These suffer the same problem in the UBA (note that ruby notations are frequently used to insert interlinear transliterations into a different script that may have a different direction than the script used in the annotated text).

Karlsson: I would agree that INTERLINEAR ANNOTATION SEPARATOR should act (for bidi) as <weak/native directional embedding> (i.e. implicitly have LDM before, and <LDM, WDE> after), and INTERLINEAR ANNOTATION TERMINATOR should act (for bidi) like PDF (i.e. implicitly have <PDF, LDM> before and LDM just after the interlinear annotation terminator).

3. Issue Verdy3, supporting embeddings in higher-level protocols:

The UBA specification should be more specific about how to support the various Bidi embedding options in “higher-level protocols”, instead of using Bidi controls in plain-text, notably:

This would require investigation with the W3C, notably the HTML5, SVG and CSS working groups.

4. Issue Karlsson1, distinguishing R and AL:

Distinguishing R and AL is questionable. It has some (very subtle) effect on “European” numbers. I’m not sure what it is. I haven’t been able to find any comprehensible account of the advantage/difference, or even examples (I don’t trust ready-made bidi handling on any system). If doing the “extra step” done for AL is advantageous (in some way), why would it not also be advantageous for R letters? On the other hand, if this extra step is not needed in the context of R letters, why should it be done when the context is AL letters? And why handle “European” digits (digits in ASCII) specially, and different from other digits?