L2/13-118

To:        UTC

Re:        6.3 BIDI feedback review

From:        Mark Davis

Live:        http://goo.gl/5zNc8 

I reviewed the Accumulated Feedback on PRI #232. The relevant editorial changes have been made, but the following issues I think are still for consideration by the committee.

Mati

Behdad

Karl

Marcin

Note: I did not review P. Verdy’s or C. E. Whitehead’s comments; I leave that to committee members.

Mati

b. Implementers may want to include more items in each entry, thus maybe replace "consists of" by "includes".

MD: We don’t specify the implementation; we just supply a logical algorithm. People are free to modify as long as they get the same results, so we don’t have to (and shouldn’t) specify every case where it would be possible to extend the algorithm.

SUGGESTION: N0 will act on level runs (and not isolating run sequences as currently specified).

Rationale: - cases b and c will behave similarly (no automatic pairing).

- if the author is smart enough to use embeddings (which justifies not pairing automatically in case b), he/she is no less smart in case c, so that he/she will take care of the level of parentheses if needed.

- simplified implementation of pairing

AG: Disagree, not changed. Isolates will be used frequently in dynamic text. The parenthesis algorithm needs to apply in such contexts. Embeddings could not be given parallel treatment since they do not form isolating run sequences. Solution is to use the isolates.

24) [...] Missing are the less-than and greater-than signs.

One important use case for pairing them is presenting XML or

HTML source code where tags and attributes are English, attribute

values may be anything, and the text between tags may also be of

any direction.

AL: I agree. Another important use case is email addresses like

"John Doe <john@doe.com>", which in RTL comes out with

the angle brackets mismatched.

While it is true that when used as less than and greater than

signs in math expressions, pairing these characters is inappropriate,

I think that it would be hard to come up with examples where not

only is a less than sign (used as such) followed by a greater

than sign, but applying the BPA to them would actually change the

display order.

AG: <> are not in the current version of BidiBrackets-6.3.0.txt, but we should discuss at UTC. I'm in favour of adding them.

MD: I am too, for reasons cited.

29) About the review note at the end of this section: I think that this is not the place to add more examples. In a normative document like this one, the role of the examples is to clarify the intent, not to justify it.

MD: I think examples always help, including examples that provide motivation. (I agree that we don’t want to be too “proposally” in the language, however.)

Behdad

Since we are making such drastic changes to bidi, I suggest we also
bump up the 61 limit.  I suggest either not specify a limit, or
something like "at least 253" kind of wording.

One of my concerns is that if, for example, a web browser ends up
using isolates or embedding characters when converting a div to text
copied to clipboard, then the deeply nested div structures of today's
web sites will make it feasible to reach the current 61 limit in a
realistic use case.

Not a huge deal, but given the computing resources of this decade,
it's just free to bump it up at least.

MD: The committee should consider.

Karl

It would be easier for implementers of TUS if a single uniform format were

adopted, and all new data files conformed to it.  And, that format should

require a minimum of effort to add to implementations.

The format of BidiBrackets.txt, for example, requires one to teach the

implementation that column 2 is one property and column 3 is another.  That is

extra work that could be avoided if the new files came in a format that didn't

require it.  An existing file with such a syntax is DerivedCoreProperties.txt.

That format could easily be adapted for non-binary properties, and many other

formats are possible.  But my point is that you should publish the files in

some such format to make it easier on implementers.  We are stuck with the

format of already-published files, but we can do better for future files.

Similarly, the now machine-readable @missings lines are inconsistent.

In BidiBrackets.txt it is

# @missing: 0000..10FFFF; <none>; n

Compare that to an @missings line in PropertyValueAliases.txt

# @missing: 0000..10FFFF; Bidi_Mirroring_Glyph; <none>

There are three columns in each, but the meanings of column 2 are

inconsistent.  One is a property name, and one is a property value.  And the

third column in one gives the default value for the property in column 2.  The

other gives the default for an unnamed property that has to be taught to the

implementation.  If new @missings lines followed the syntax from

PropertyValueAliases.txt, no teaching would be necessary.  In

BidiBrackets.txt, there would be two such lines, one for each property.  My

implementation already deals with the possibility of multiple @missings lines

per file, as several existing files have them.

[...]

The bottom line of what I was trying to say is that going forward, each new

data file should be in a form that doesn't require manual intervention to

specify to an implementor.  This could be because the format of the file has

each line contain only values for a single property, and includes that

property name; or there could be machine-readable comments that describe the

format of each entry, so that the file becomes self-describing.

Currently, one has to know the file's format in order to interpret the

@missings (supposedly) machine-readable line in this file.  In the past, I've

coped with this by using the @missings lines in PropertyValueAliases.txt, but

there is no @missings entry there for Bidi_Paired_Bracket_Type.  I presume

that is an oversight that will be fixed before final publication.  But, I

believe all @missings lines should look like those in

PropertyValueAliases.txt, with each containing full information, and not

depending on the format of the file they are contained in.

MD: As I recall, the committee made a conscious decision to present the values in multiple fields (as we do in some other files), so that they are exactly parallel. Our formats do have to balance multiple, sometimes competing, goals:

  1. Maintain compatibility for parsers
  2. Be easy to read — for implementers
  3. Be easy to maintain — for us.

What we have ended up doing for these is to generate Derived files that have a simple format; we may want to consider that here as well.

I agree that the <none> is a problem. It is a pain to have to deal with data values that are either code points or sequences of code points, and have a <none> value. They don’t correspond nicely to APIs, especially for single code points, where you always want to return a primitive type.

As to the @missing, I think it should exactly mirror whatever data lines are in the file in terms of content. I have an action to look at that.

Marcin

2. I agree with Aharon Lanin that it should be made clear that all characters

with Bidi_Paired_Bracket_Type values Open or Close have bidi class ON (the

note at the end of rule N0, bullet d implies that, but it should be mentioned

explicitly); in fact, I think it ought to be a Unicode Stability Policy.

3. It might be worth mentioning (in the Implementation Notes section, perhaps)

that Rule N0 and the associated definition BD16 can be implemented without

actually creating a stack or list that BD16 calls for; such an implementation

would be slower, but could require less memory, which can be important for

embedded systems with limited RAM.

One way to implement BD16 with minimum memory requirement might be as follows:

* For each character with Bidi_Paired_Bracket_Type other than None, assign a

  status, one of: unresolved (initial value), resolved as paired, resolved as

  unpaired.  Note that if such characters are guaranteed to have bc=ON, the

  Bidi_Paired_Bracket_Type property and the status can be encoded by creating

  additional, ‘virtual’ bidi classes (which would behave as ON for all the

  other purposes).

* For each unresolved closing bracket, search backward until either sos or an

  unresolved opening bracket that forms a bracket pair with the closing

  bracket is found.  In the latter case, resolve both brackets as paired, and

  if there are any unresolved opening brackets enclosed within the pair,

  resolve them all as unpaired.  [Note: This corresponds to the 5 steps listed

  in BD16.]

* Once the previous step is complete, for each opening bracket resolved as

  paired, the matching closing bracket can be found by the following

  algorithm:  Initialize a counter to 1.  Scan forward the isolating run

  sequence, incrementing the counter for each opening bracket resolved as

  paired, and decrementing it for each closing bracket resolved as paired; the

  matching closing bracket is the first one that causes the counter to be

  decremented to 0.  (This would work because bracket pairs, as defined by

  BD16, may be nested, but cannot otherwise overlap.)

* Note that closing brackets do not have to be resolved as unpaired; as long

  as each is checked only once, those that are not resolved as paired can be

  left in the unresolved state.

MD: The committee should consider whether it is worth adding in the implementation guide section.

Mark

Reference Implementation. Given the complexity of the new algorithm, I think it is incumbent upon us to have two independent reference implementations before we can release U6.3. Moreover, these must be tested against one another in a thorough “monkey test”, and we should recommend that any production-level implementation do the same. Merely extending the BidiTest file will not be sufficient.