Note, before this public review issue was posted, there was earlier feedback on ELM from Matitiahu Allouche via document L2/11-306.A. Introduction1. Use cases:Example from PRI:
An Arabic numeric date of the form dd/MM/yyyy in which the fields should flow left-to-right (e.g. ٠٩/١٦/٢٠١١) in a left-right context (i.e. the date and perhaps some other Arabic text are in a mainly Latin-script paragraph), but should flow right-to-left (e.g ٢٠١١/١٦/٠٩) in a right-left context (e.g. a primarily Arabic-script paragraph). The date may or may not be preceded or followed by Arabic letters. If the direction context is known when the date format is created, then RLMs can be used if necessary to force the desired flow. However, if the date format is being created from standard data (as from CLDR) and inserted, or if it is copied from some other context, it may not end up laid out as desired. To address this, an LDM could be used before and after the date, and before each of the '/' in the date. This would produce the desired result in all cases.Example from UAX #9 section 5.6:
; in this case an LDM could be used before each '-' to achieve the correct layout regardless of overall page direction.Note:
The behavior of '/' between digits depends on language & locale as well as usage. For example, numeric fractions represented as numerator/denominator always flow left-to-right in Hebrew regardless of direction context (e.g. 1/3), whereas in a right-to-left Arabic context they flow right-to-left (e.g. ٣/١ or preferably ١\٣). This affects the degree to which heuristics can be used to determine LDM-like behavior.Note:
The advantage of LDM is that in many cases it can be used without much awareness of or tailoring for the specific content with which it will be used. The disadvantage, of course, is the difficulty of integrating with the existing UBA. So it may turn out to be something that is added only if we go to a UBA v2 for other reasons as well.Verdy example:
Another example where there are ambiguities on how to resolve the direction of characters other than CS. Check this page on Wikisource (has text in French containing a comma-separated list of Hebrew words; commas are incorrectly placed):http://fr.wikisource.org/wiki/Page:Diderot_-_Encyclopedie_1ere_edition_tome_1.djvu/96
Verdy was not able to find a working solution for correct display in Chrome (see discussion below in section C, Opinions).
Edberg: Note that for UBA (not Chrome), LRM (or LDM) after each comma solves the problem.2. Overview of the following:
The remainder of this post first covers (in section B) alternatives presented to the proposal in the PRI:
- Verdy suggests either encoding duplicates of existing characters with different bidi classes, or preferably changing how the UBA treats embeddings.
- Wordingham suggests that LDM is unnecessary, and can always be replaced by suitable combinations of LRM and RLM (though sometimes this is complex).
- Karlsson suggests a UBA v2 that uses heuristics to provide better implicit handling for the cases at issue.
It next covers (section C) the opinions expressed on UBA stability, and on the various proposals. Finally, it covers (section D) three other bidi issues that were raised during the course of the discussions.B. Alternate proposals1. Proposal Verdy1, encode duplicate chars:
For the date example, encode another '/' with bidi class R. More generally, encode new characters as necessary with another (existing) Bidi class. The problem: you might need to duplicate lots of characters - most whitespace, punctuation, and symbol characters - that are neither letters or digits or combining characters, and that don't have a strong RTL or LTR directionality: much more than just CS characters. WG2 will most probably strongly oppose to this UTC proposal.Edberg:
Not sure how this solves the problem. For example, a numeric date format using ‘/’ with bidi class R would always be laid out right-to-left, instead of adapting to the paragraph or embedding level direction as desired.2. Proposal Verdy2, change UBA for embeddings:
A two-part proposal:
- Change the UBA so that entire embeddings created by RLE..PDF or LRE..PDF are treated as directionally neutral [or weakly directional?] for the purpose of resolution of levels outside the embedding. Thus CS separators (and in fact any other separators, including ‘-’ or ‘.’ or ‘:’ commonly found as date/time field separators), would not be influenced by the resolved direction of the internal content of the RLE..PDF or LRE..PDF fields. Embeddings should just set the direction to be used internally, hiding this detail to the outside. Both sequences should behave externally as if they were a single character with weak Bidi class. Otherwise they are not really “embedding”. Even the name “pop directional format” is misleading in this case because it actually does not restore the state that was before the state pushed by LRE/RLE.
- Then (e.g. for the date example) embed those numeric fields in RLE..PDF or LRE..PDF (you can choose them arbitrarily, independently of the numeric characters used in those fields; or even if those fields contain letters such as an abbreviated month name, or a CJK telegraphic abbreviation for month numbers or year numbers; but if the content of the field contains itself some whitespace or variable characters, the choice of LRE..PDF or RLE..PDF would not be without importance, for the inner presentation of the content of this field, but would still have no influence outside of the field).
Using RLE..PDF and LRE..PDF solves all ambiguities, and requires no additional Bidi classes to be encoded (or even “implied”), and requires no new Bidi control. It does require a change to UBA rules for embeddings. It provides the safest solution.
In absence of markup, the existing CS class between two numbers should ALWAYS resolve using the bidi class of these numbers (this means that “1.2” would always be considered as a single number, and “31/12” would always be a fraction).
To change this meaning (and the expected rendering order if the embedding paragraph is RTL), there's only one way: isolate the numbers in LRE..PDF, so that it prohibits the propagation of their strong directionality to the separating character of class CS.
The whole sequence “LRE(number)PDF” then externally has a weak direction, just like the surrounding CS characters, whitespace, and other punctuations, and all these runs will need to take their direction from the embedding paragraph.
In my opinion, encoding the proposed
ONE, TWO, LDM, SLASH, THREE, ONE
should render exactly the same thing as with existing controls in
LRE, ONE, TWO, PDF, SLASH, LRE, THREE, ONE, PDF
for representing a contextually commutative date “12/31"; and for the non-commutative fraction “12/31" you can protect the expected direction of fields with
LRE, ONE, TWO, SLASH, THREE, ONE, PDF
By including the slash within the embedded span, if the existing UBA is correctly implemented. And it does not require any new character (LDM, or a duplicate SLASH), or any new Bidi class, or changing the UBA algorithm to treat CS characters specially.
The example in UAX #9 section 5.6 is handled if the field separator is <PDF,HYPHEN MINUS,RLE> or <PDF,SLASH,RLE> or <PDF,FULL STOP,RLE> or <PDF,COLON,RLE>, and the leader is RLE and the trailer is PDF.
(Substitute RLE by LRE everywhere as you want: this is equivalent if the numeric fields are using European or Arabic digits, this may only change if a field is an abbreviation starting by a whitespace or variable character, or if it mixes LTR letters and Arabic digits).
This does not only concerns date values, this may apply to time, phone numbers, numeric identifiers that use separators such as social security numbers, indexes of TOC entries, various sub-classification schemes used in book libraries or even technical protocols (including DNS or LDAP names)…
LRE..PDF and RLE..PDF also have a bijective mapping with the well-known HTML "dir=" attribute of inline elements, when it gets mapped into the equivalent CSS style property that can map this dir= attribute with bidi embedding values, so that these Bidi controls (strongly not recommended in HTML) can be avoided completely. This means that a date like visually rendered “12/31/2011” in a LTR-only document can be formatted in HTML as:
<span dir="ltr">12</span>/<span dir="ltr">31</span>/<span dir="ltr">2011</span>
so that it will be reordered contextually as “2011/31/12” depending on the contextual direction of the inline text before it (or after it if there's no strong direction set by previous inline content within the same containment block or embedded span, and remaining in weak direction if there's no strong context at all in that block or span, so that the block or span will itself inherit the direction from a lower contextual embedding level, or from the default direction set by the document language if there are no more context).
For this reason, I am convinced that, in absence of such embedding, the CS characters should always limit their context to only their immediate neighboring characters, so that “12/31/2011” will always keep the same direction of fields in all contexts, in absence of such markup by an external stylesheet, or Bidi controls, and that “12/31+2” will NEVER be reordered contextually (preserving the mathematical meanings of operators by which operands cannot freely commute)
Essentially I think LRE...PDF (and similarly for the other start bidi bracketings) should behave as if they had an inherent LDM (LEVEL DIRECTION MARK).Edberg:
With Verdy’s proposed change to the directionality associated with an entire embedding (that it has weak or neutral directionality, instead of strong directionality associated with the specified embedding direction), this would appear to work.Karlsson:
If one changes LRE etc. to have an inherent LDM *functionality*, an actual character for LDM is not needed, *nor* is a new bidi category needed. The function of an LDM character can then be achieved by <LRE, PDF> (or <RLE, PDF>, <LRO, PDF>, or <RLO, PDF>); note: empty string between the start and end bidi control codes.
I still think there are plenty of other reasons to go for a UBA v.2; also the change suggested here is probably best done in a UBA v.2 rather than in the current UBA.3. Proposal Wordingham1, LRM and RLM instead of LDM:
LDM is superfluous, can always mimic it with combinations of LRM and RLM. The disadvantage of not having LDM is that the alternative rules are complex. European digits (EN) are extremely complicated, as one has to consider the preceding strong character - L, R, AL or LDM.
For the examples given: Embed the common separators such as ‘/’ in RLM...LRM. This would ensure that they took on the directionality of the embedding.
I have demonstrated to my satisfaction that text with LDM can be converted to text without LDM that should display the same, under the following debatable assumptions:
- The remaining neutrals prior to the application of Rule W7 in the UBA do not ligate or kern with non-neutrals.
- Non-displaying runs embedded within other runs have no effect on the display.
I can make the conversion tables available on request.
Karlsson: I’m not at all sure the suggested workaround works in general, and not just in a few examples.
Edberg: I am not convinced that LDM can always be mimicked by a combination of existing controls; see discussion below. Also, Wordingham’s substitution rules & tables (sent to me separately) state “W0 can actually be deferred to between W6 and W7 without changing anything.” I do not agree; resolution of LDM to LRM or RLM in W0 could affect, for example, the change of EN to AN in W0 (this effect might be an improvement, but it would be a change nonetheless). The proposed ALM will help address these issues.a) For date example:Edberg:
For an isolated instance of such a date, RLM...LRM around ‘/’ does work; having opposite strong directions on either side of the neutral forces it to take on the direction of the embedding level, by rule N2, and though the extra RLMs and LRMs will get moved around in layout depending on the embedding level, it does not matter because they are invisible.
However, it does not handle the situation in which the date is part of other text, possibly preceded or followed by Arabic letters (with an intervening space); there are layout interactions between the Arabic letters and adjacent Arabic digits, since the digits are not treated as being part of a longer sequence due the direction marks associated with the '/'. This can be solved by placing an LDM before and after the date, as well as before each '/'. However, using an RLM LRM sequence before and after the date causes the spaces around the date to reorder.Wordingham:
The interaction between Arabic letters and Arabic digits that are part of the date occurs in a left-to-right embedding. The non-LDM solution in each case is to insert an LRM after the separating space.
Incidentally, if Arabic digits (AN, not EN) are used, the separators should be terminated by LRM on one side, not by both RLM and LRM.
The outer marks are to protect against adjacent R and AN in a *left-to-right* context:
Storage: ' ' LRM AN1 '/' LRM AN2 '/' LRM AN3 ' ' LRM
Display in left-to-right context: ' ' AN1 '/' AN2 '/' AN3 ' '
Display in right-to-left context (e.g. after AL in memory): ' ' AN3 '/' AN2 '/' AN1 ' ' (AL)
Actually, though, <LRM ' ' AN1 '/' LRM AN2 '/' LRM AN3 ' ' LRM> would be still better, as that also works if there is a preceding L in an RTL context.
However, if the text receiving calculated text is directional, I don't think it unreasonable to require that the receiving text separate calculated text from elements in the receiving text with the contrary direction. (In this context, the directionality of AN is contrary to left-to-right text.) After all, a comma-separated list of Hebrew words in left-to-right text would naturally be written with a separator sequence such as <U+002E LEFT-TO-RIGHT MARK, U+002C COMMA, U+0020 SPACE>. The commas are to the left of the spaces.
Therefore, I think it reasonable to require the guarding LRMs above to be provided as part of the receiving left-to-right text. Because the date uses Arabic digits, no protection is needed in right-to-left text. Protection would be needed if the date used 'European' digits, as it presumably would in Persian or Urdu text. ('EXTENDED' ARABIC-INDIC DIGITs are EN!)Edberg:
While the last formulation above does appear to work in cases where spaces are needed, it is still that case that different formulations are needed depending on whether spaces are required. A date format using LDM as follows would work for all contexts (and with or without spaces):
LDM AN1 ‘/’ LDM AN2 ‘/’ LDM AN3 LDM b) For UAX #9 section 5.6 example:Edberg:
For the example in UAX #9 section 5.6, using RLM and LRM around the '-' causes reordering of the adjacent spaces, while using LDM before each '-' solves the layout problem.Wordingham:
Of course, the problem of spaces is cured if one uses <RLM, SP, HYPHEN-MINUS, SP, LRM> as the bounding delimiter.Verdy:
And no problem at all of ordering if the field separator is <PDF,HYPHEN MINUS,RLE> or <PDF,SLASH,RLE> or <PDF,FULL STOP,RLE> or <PDF,COLON,RLE>, and the leader is RLE and the trailer is PDF. (Per proposal Verdy2, see discussion above).Edberg:
The Wordingham formulation above does appear to work for either embedding/paragraph direction (i.e. this text can be inserted without knowledge of the context into which it is being inserted).4. Proposal Karlsson1, UBA V2 with heuristics:
UBA V2 proposal: Make begin/end paired (i.e. bracketing) punctuation work (better) out of the box:
g.c. Ps, default for Pi: Have an implicit just after the character and an implicit LDM just before the the character. The choice between LRE and RLE is done in the similarly as determining default paragraph direction, but only looking at an appropriate substring, and with the surrounding direction as default.
g.c. Pe, default for Pf: Have an implicit just before the character and an implicit LDM just after the the character.
Whether a Pi or Pf character is regarded as begin or as end character should be possible to override by a higher level protocol (f.i. a locale setting, including language tagging for parts of texts, specifying begin/end quote marks; though there is a difficulty when the begin and end quote marks are the same character).
UBA V2 proposal: make segmenting punctuation work (better) out of the box: g.c. Pd (and more, see below): If both adjacent characters to the Pd character do not have strong directionality (perhaps limit to SPACE), have an implicit LDM on both sides of the Pd. IIUC, this should make the Pd behave as a TAB (bidi category S), but at the embedding level, rather than the paragraph level.
Rename S, “Segment Separator” (currently the characters TAB/HT, VT, and INFORMATION SEPARATOR ONE/US) to “Paragraph Level Segment Separator”; and refer to a character with implicit LDMs, but no implicit LRE/RLE/PDF, as an “Embedding Level Segment Separator”.
Higher level protocols should be able to set other characters (than Pd with no strong bidi character directly adjacent) as Embedding Level Segment Separators. For instance, for domain names and URLs, FULL STOP, SOLIDUS and a few more URL punctuation characters should be handled as Embedding Level Segment Separators (for this to work well, there needs to be an implicit LRE or RLE at the beginning of the URL and an implicit PDF at the end of the URL). One should also consider making COMMA, SEMICOLON, FULL STOP, QUESTION MARK, and similar characters (when followed by SPACE or an ending paired punctuation) into Embedding Level Segment Separators in general, now that most LRE/RLE/PDF are implicitly inserted by bracketing punctuation.
It is important that this works in the vast majority of cases *without* the *explicit* insertion of invisible (bidi control) characters. The latter is highly user unfriendly, and is unlikely to work well with cut-n-paste.
I realize that going for a “bidi V2” is a major step, but I think it is called for for several reasons, particularly the ones mentioned above.
Note1: It is also possible to use a different analysis for special cases, e.g. domain names or URLs (if detectable somehow, e.g. via markup).
Note2: It is not the case that all bidi control characters can be avoided in all cases using my suggestion. But a great many cases, many that surprise users, would with the implicit bidi control approach work with much less surprise, and no need to insert explicit bidi controls (something which is not so easy).Verdy:
And how will you define what is an “implicit” LDM ? For example “1.2” have two interpretations (a single number with a decimal separator, fields of digits have a fixed relative order and the dot is part of that number; or a notation of two distinct numbers in fields separated by the dot, the fields being assumed to be displayed in the same direction as the embedding paragraph). Same thing about “31/12” (is it a date made of two fields to render in the embedding direction, or a fraction whose operands must NOT commute ?)
As this is impossible to determine, I really think that in absence of markup, the existing CS class between two numbers should ALWAYS resolve using the bidi class of these numbers (this means that “1.2” would always be considered as a single number, and “31/12” would always be a fraction).
To change this meaning (and the expected rendering order if the embedding paragraph is RTL), there's only one way: isolate the numbers in LRE..PDF per proposal Verdy2.Edberg:
Considering for the moment just the portion above dealing with “segmenting punctuation” (Pd etc.) and segment separators, since that is related to the intent of LDM: Karlsson suggests the addition of a new bidi class “Embedding Level Segment Separator”, which basically has the directional behavior of LDM, and could be applied to any character (for example '/' in the date example) as an override. This aligns with option 3 in the background document (and once there is an LDM-like bidi class, I think it is a small step to actually encode a character like LRM or RLM that has this new class). Mr. Karlsson further suggests that characters of general category Pd, at least if they are surrounded by space, should behave as if they have this “Embedding Level Segment Separator” class. This would mean that the example in UAX #9 section 5.6 would be handled without explicit LDM. This is an interesting idea, but I have not examined all of the implications.5. Proposal Karlsson2, reinterpret S:
Another possibility, as long as we are just “brain-storming” a bit here, is to use the bidi category S (Segment Separator) for the LEVEL DIRECTION MARK (which would be a normally invisible (bidi) format control character). I.e. it would work just like TAB (as specified in the UBA), except that it wouldn't do tabbing. But then it would work only for the paragraph bidi direction. However, the idea that TAB (and the other bidi S characters) magically cuts through *all* nested bidi levels seems a bit strange to me... Going just to the closest explicit embedding/(override) level seems less drastic. Without formally subdividing “S”, one could treat different “bidi S” (old and new) to reset to different levels (to the embedding bidi level for the new one, and to the paragraph bidi level for the three old ones). (I know, this would be a form of “option 1” in the PRI.)C. Opinions on UBA stability and PRI options1. Opinion Freytag:
Adding controls would imply the creation of new bidi classes for them, and giving up stability to that degree would be a serious issue.
Stability is paramount for predictability. You need to be able to predict what your reader will see, and you will only be able to do that, when you can rely on all implementations agreeing on the details of how to lay out bidi.
Introducing any new feature now, will result in decades of implementations having different levels of support for it. This makes the use of such a new feature unpredictable - and is a problem whether there was a formal stability guarantee or not.2. Opinion Verdy0 and discussion:
The stability policy was published too soon before solving evident problems).Freytag:
I disagree. True plaintext bidi will always be a compromise, because there's a lack of information on the intent of the writer. (In rich text, you can supply that with styles). There's a limited workaround with bidi controls, but that's beginning to be a form of minimal rich text in itself.3. Opinion Verdy1:
Adding ELM or any control is effectively creating a new Bidi class, because it requires all existing UBA implementation to be updated to accept the new behavior. The ELM proposal breaks the stability promised for the bidi classes and UBA. None of the proposed 3 solutions are in fact acceptable if stability is required.4. Opinion Verdy2 and discussion:
The stability of Bidi classes should only concern existing characters, it should not limit new characters (or new scripts) that may need new Bidi classes, as it would not break existing texts rendered in existing implementations. The stability is just meant to NOT break any bidi rendering of existing fonts that use assigned characters. For existing unassigned code points, there's simply never been any stability warrantied for any property, so you can assign the properties much more freely.
I am convinced that if you need new characters, the only good question is which ones?
- Duplicate the encoding of existing whitespace, punctuation, symbols to give them a different bidi class (using one of the existing classes). This is alternate proposal Verdy1 above.
- Encode new bidi controls, to which you assign new bidi classes. This does not break ANY existing text rendered with any existing renderers. Of course you'll need an updated renderer (but not new fonts), otherwise existing implementation will display a .notdef glyph and the user will know visibly that there's something in the encoded text which may be important to render the text correctly.
The second option is certainly the least disturbing (and the most economical in terms of encoding, and the most likely to be accepted without much troubles by voting NBs in WG2).
Edberg: This seems to contradict the statements in Verdy1 above, and suggests that you would have no objection to a new character LDM with a new bidi class.
It does not break the policy on ANY existing encoded texts. It gives NO surprise to users, or at least they know that something is missing, and their decision for what to do will be exactly like when they are presented newly encoded texts containing newly assigned characters for which they still don't have a supporting font or any support in their existing renderer for the complex shaping/layout features required by a newly encoded script.
In other words, the UTC policy about the stability of Bidi classes should be minimally relaxed, by rewording into something like the following, which is is similar to the rule of immutability of other normative character properties of assigned code points (code point value, character name, decomposition mapping ...):
“The bidi class property value of any assigned code point is IMMUTABLE (and will never change for the same assigned code point in any subsequent versions of the UCS).”
I can accept that the full set of possible values for the general category is restricted and inextensible, because these categories are frequently used in algorithms where the GC is supposed to be fully partitioned with a constant number of elements (a fixed enumeration) for implementing lots of other algorithms or derived properties. But the Bidi class for characters is just meant for the rendering, and has no other use than implementing the UBA itself; it should never be used for any exclusive yes/no decision.
With respect to proposal Karlsson2: If you want to encode new characters, why would you restrict yourself to reusing an existing bidi class just to break it? Instead of speaking about the poorly defined concept of “splitting the bidi classes” - if you add a new bidi class for new characters, you effectively never split any existing bidi class, and you don't break the IMMUTABILITY rule I gave above.Karlsson:
The stability guarantee says “The Bidi_Class property values will not be further subdivided.” I’m not too keen on the word “subdivided” here, but it (here) means there will be *no additions* to the set of values for the Bidi_class property. Not even for new characters.
As far as I can tell, there is no restriction saying that the bidi algorithm cannot look at code points as well as bidi category values.Verdy:
That's absolutely not the way I understand it, notably if you consider the term “further”, which references what was done before, where subsets of characters that were listed in the same class have been later partitioned into separate classes, before the policy was adopted. I am not advocating changing the bidi classes already assigned to characters, just that currently unassigned characters are already outside of any one of these classes.
Wordingham: Re “The bidi class property value of any assigned code point is IMMUTABLE”: At most it should become immutable after being unchanged for, say, 20 years. It is unwise to prohibit correction.
Verdy: Not needed. Even if there's an error, it will be much better to re-encode the character with a new Bidi class, and not break the many texts already containing the character (note we're discussing widely used characters such as generic punctuation).5. Opinion Verdy3:
Corrections should be made in the early stages of beta releases, or based on documents that lead to the initial encoding approval for encoding. This should come only as erratas coming extremely fast (one or two months?), and caused only by discovered editorial problems which contradict the prior decision.
After 20 years, as you suggest, the cost for correcting the encoded documents would be excessive, and the correction will not be applied consistently before about the same time or more. This would mean at least about 30 years of instability, and lots of data losses in that period (which will be extremely hard to estimate in time and total cost supported by all users of the UCS).
[Per proposal Verdy2 , Verdy feels it is OK to change the current UBA so that the embeddings have a different effect on the direction resolution of characters outside the embeddings - i.e. the embeddings as a whole would not have strong direction. He prefers this to any of the other proposals, and feels that this would have less of an effect on stability and existing implementations. Not sure how this fits with the opinions expressed in Verdy1 and Verdy2 above. -mod.]
Proposal Karlsson1 breaks the UBA in a non-conforming and incompatible way. I'm now sure that LDM is not even needed if the UBA is implemented correctly [presumably per proposal Verdy2 - mod.]
If browsers already have problems in correctly implementing the UBA [as with the Verdy example in section A.1 -mod.], it will be even more difficult to convince browser authors to make new adjustments if you change the behavior of CS characters in the UBA… [presumably referring to proposal Karlsson1 -mod.]
But it will be far more easy to convince them to accept a new character that uses one of the existing Bidi classes, even if the character is superficially the same (but you'll get strong oppositions from WG2 that will be hard to convince if you want to disunify some characters with a duplicate encoding only for a distinct Bidi property).
Even better is to correct the UBA to get the expected full restoration of context by PDF, rather than adding a new LDM (and an associated new class) which will still require a change in the UBA to be effective, and that will also break the Unicode stability rule.
Edberg: Hmm, per opinion Verdy2 you were suggesting that adding a *new* character with a new Bidi class would *not* break the stability rule.6. Opinion Karlsson1:
Option 1 [no new class, change UBA to handle LDM specially] is likely to result in implementations effectively defining their own additional classes, as noted.
Option 2 (define LDM for higher-level protocols only] is a no-go. I do see other uses for higher level protocols affecting bidi processing, though. But not this way, just for doing a one off circumvention.
Regarding option 3 [new UBA v2], it would also be an opportunity to do away with the difference between R and AL (and hence make ALM moot, and maybe also remove the EN and ES bidi classes (for “V2”).
Indeed, if going by option 3, we could take the opportunity to improve the bidi algorithm (V2). Some things don't work as they should do “out of the box”. It is often said “just leave it to the bidi algorithm, it will do the right thing”. But it much to often does NOT do the right thing. Two major defects are detailed in Unicode standard annex 9, in sections 5.5 and 5.6. The bidi algorithm has glaring deficiencies that I think would be best handled by going for option 3 UBA v2, where these glaring deficiencies can be addressed; to a large extent by the use of *implicit* LDMs (and *implicit* LRE/RLEs and PDFs).7. Opinion Karlsson2:
All the workarounds w.r.t. LDM depend on the directionality of neighboring characters, not directly on the embedding level direction. Therefore I think none of them will work properly in all cases (even though they may give the seemingly correct result in many cases). And they all require an inordinate amount of insertion of bidi control characters. (Much better to have *fewer* bidi control characters and still get a desirable display.)
Verdy: Marking the slash in “12/31” with LDM, it will not solve any ambiguity. The only safe way is to use embedded levels.D. Other issues with UBA1. Issue Verdy1, weak direction embedding
I’d like to add a remark: we can embed strong LTR or strong RTL sequences (with LRE/RLE...PDF), but there's currently no way to embed sequences that should start by characters with weak direction (in fact all characters except letters: the RLE or LRE start is too strong, we would also need a WDE control, for Weak Direction Embedding, where the start of the internal substring would have its content adopt the direction of the first string character in it, and if not found, would then inherit the direction of the outer text, recursively). Maybe we could emulate it using RLE,B..PDF or LRE,B..PDF (but with which B character ?). I know this can be tricky, because the UBA currently starts by splitting paragraphs independently, dropping all contexts existing before them, before resolving direction levels.
I wonder how the WDE..PDF feature would be best supported (or emulated) if using existing Bidi controls (including by implicit insertions, for example between paired bracket punctuations Pi and Pf)Edberg:
Deborah Goldsmith has also suggested a “native direction embedding” which is like LRE/RLE but uses the inferred primary direction of the embedded text. I will try to put together a proposal about this for the next UTC.Karlsson:
And [native direction embedding] is indeed what I suggested for ‘(‘, ‘[‘ and other beginning punctuation (general category Ps and default for Pi) in my response to the PRI, *without* necessarily actually having a new *control character* for it. Ending punctuation (Pe and default for Pf) would in my suggestion act like bidi PDF. Note that the beginning and ending punctuation also must take on the current (surrounding) embedding directionality, which unfortunately LRE/RLE/PDF characters by themselves don't do in current UBA; and one must of course not do rule X9 for characters that aren't pure bidi controls.2. Issue Verdy2, interlinear annotations:
I have another problem with the Bidi algorithm: it does not work correctly with interlinear annotations (for exactly the same reason: the text between IAS and IAT should be embedded internally with a weak direction, and the direction of the text after IAT should not depend at all of the text between IAS and IAT, but only on the text between IAA and IAS, or before the annotation anchor (so IAA can be ignored by the Bidi algorithm, IAT should behave like PDF, but which of the LRE or RLE class should IAS use? IAS would also need the same class as the missing WDE described in the previous paragraph!
The only very poor way to create a sort of WDE would be to encode RLE or LRE immediately followed by a paragraph separator (CR, LF, NL, PS) whose vertical advanced should be canceled in the rendering, because paragraphs always start with a weak direction for Bidi processing.
So to handle interlinear annotations, you would insert two Bidi classes for this character, instead of just one:
- map IAA to the N class
- map IAS to the LRE or RLE class, immediately followed by the B class
- map IAT to the PDF class
And I'd like to propose that the WDE sequence be encoded as <LRE,NL>, and with the intent that the NL (or CR or LF, or CR+LF or PS) would not be rendered with its implied vertical advance in the context of a previous LRE or RLE.
Both proposals do not require changes in existing Bidi classes, but still changes in the UBA to handle these special (forgotten) sequences:
* WDE ::= (RLE | LRE) (LF | CR LF? | NL | PS) ; // Bidi class = (RLE|LRE), B // (to be used with PDF)
* IAA ; // Bidi class = N
* IAS ; // Bidi class = LRE,B
* IAT ; // Bidi class = PDF
Note that the equivalent of interlinear annotations in HTML is the ruby notation, and in CSS the absolutely positioned blocks with “display:inline-block;position:absolute”. These suffer the same problem in the UBA (note that ruby notations are frequently used to insert interlinear transliterations into a different script that may have a different direction than the script used in the annotated text).Karlsson:
I would agree that INTERLINEAR ANNOTATION SEPARATOR should act (for bidi) as <weak/native directional embedding> (i.e. implicitly have LDM before, and <LDM, WDE> after), and INTERLINEAR ANNOTATION TERMINATOR should act (for bidi) like PDF (i.e. implicitly have <PDF, LDM> before and LDM just after the interlinear annotation terminator).3. Issue Verdy3, supporting embeddings in higher-level protocols:
The UBA specification should be more specific about how to support the various Bidi embedding options in “higher-level protocols”, instead of using Bidi controls in plain-text, notably:
- in HTML with the dir= attribute, which may need to add an additional value dir="neutral";
- in CSS (adding to draft v3?) with an equivalent CSS property for embedding control, in combination with the default HTML stylesheet that usually maps the dir= attribute to such CSS properties;
- also for the SVG working group, where it uses “tspan” subelements in “text” elements, either with attributes, and/or with CSS styles also mapped from these attributes using appropriate selectors.
This would require investigation with the W3C, notably the HTML5, SVG and CSS working groups.