L2/09-315

Request for Corrigendum to UAX#14, Unicode Line Breaking Algorithm

Submitted for UTC consideration by: Asmus Freytag
September 22, 2009

Issue a corrigendum to UAX#14, applicable from version 3.0.0, changing rule LB8 from

LB8.    Break after zero-width space.

ZW ÷

to

LB8.   Break before character following a zero-width space, even if one or more spaces intervene.

ZW SP*  ÷

Note, the line break classes in this rule are ZW, which is the ZWSP character and SP, which is the SPACE character; both are single character line break classes. As usual, the character ÷ means a break is allowed in that location. * means "zero or more".

For the effect of the corrigendum and why it is needed see the following background information.

Background

Eric Muller has discovered a bug in UAX#14. When, 10 years ago, UTC created the rule that's now LB8, UTC didn't correctly resolve the interaction with rule LB7 (x SP).

This has three consequences:

1) As Eric reported:

Consider the input ZW CL, and let's determine if there is a break before the CL. Rule LB8 provides the answer: ZW + CL.

Consider the input ZW SP CL, still for the position before the CL. This time, LB13 provides the answer: ZW SP x CL.

The same applies if one replaces CL by EX, IS or SY (the other things equivalent wrt LB13) or WJ (handled similarly in LB11).

Note that line break rules are always invoked in ascending order. Therefore, Rule LB7 (x SP) prevents a break before the SP with higher priority than LB8 (ZW ÷) allows a break after the ZW. ( ÷ means break allowed, x means, no break).

As currently written, these rules would mean that adding a SP to a string would *remove* a line break opportunity in this example. That's completely counterintuitive, and clearly a bug in the specification.

2) As Eric further noted, "that mechanism [i.e. SP removing a line break opportunity] does not exist [in the pair table implementation], hence the example pair table does not implement the rules."

As far as I know, this is the only unreconciled difference between the pair table and the rules, and it's not by design. The intent is to have these two agree with each other, and also to have the ZW always result in a break opportunity.

3) Creating the corrigendum as proposed would allow any implementations that have followed the pair table to claim conformance to the proper version of UAX#14 with the corrigendum. This is especially useful as what they implement, which matches the proposed behavior, is more in line with user expectations.

What's the best way to reconcile this?

A simple addition of SP* to LB8 would reconcile the rules with the pair table and at the same time replace the strange, counterintuitive behavior by something more regular and in keeping with the rest of the design.

Proposed Fix:

Change  LB8  from  ZW ÷   to   ZW SP * ÷

What will that do?

  1. The new rule stipulates that when you insert a ZW into a line of text, you will get your linebreak, but it takes effect at the end of any adjacent run of spaces. That's consistent with how all other line break opportunities work in the algorithm (except hard line breaks).

  2.  Unlike the current formulation, where the SP* is missing, there will be no interference from lower priority rules. This is important, because ZW was promoted, in Unicode 3.0, from an ordinary break opportunity with lower priority to a high priority override for all characters covered in rules LB9 to the end.

  3. Making this change reconciles the rules with what the line break table has implemented all along. (It is noted in the pair table in section 7 of the UAX with a "_" at the intersection of the ZW row with the CL, EX, IS, SY and WJ  columns)..

Implications

There are conformance and practical implications.

The practical implications are limited, because the current rules, applied literally, exhibit counterintuitive behavior, that furthermore occurs only in rather contrived contexts. Such behavior is rather  unlikely to be an outcome  deliberately desired by a document author. ZW is typically applied between letters of some sort, not in front of space characters that are followed by closing or terminal punctuation. Whenever it is applied , the intent is to cause a break, which the current rules don't allow.

The formal conformance implication is that all existing implementations, from 3.0, that were based on the pair table, while doing the "right thing", are formally non-conformant. Further, such implementations can't be cheaply made conformant, because the pair table can't express the concept of "add a space to remove a line break opportunity" without redesign of the driver code and table architecture.

A corrigendum gives these existing implementation a formal conformance target.

For rule based, and regex based implementations, implementing the proposed fix means a localized change.

Interactions with other Rules

The proposed corrigendum changes LB8 from ZW ÷ to ZW SP* ÷ and it's necessary to investigate the interactions of all the other rules up to LB18, which is the one that handles all other breaks after SP. That rule (LB18) is SP +, so there's no interaction with either new or old LB8. (See the appendix below on "how to verify interaction between rules").

The rules where there are interactions are the ones cited by Eric, LB11 and LB13. Those are the only ones with higher priority than LB18 where there is a leading "x" in the rule (for example x CL or x WJ).  Those two rules, LB13 and LB11, in interaction with LB7 describe the contexts that should be affected by this change, so that interaction is by design.

All other rules are unaffected. Either they occur below SP ÷, or they don't start with 'x'.

Other options investigated

Move LB8 before LB7 (that is renumber to LB6a).

This option is inferior on three important counts.

First: it would allow break opportunity before a SP character. This would add a new design element, because spaces are otherwise elided when they occur at a line break opportunity. The only way to break a line before a space is by using a hard line break. However, hard line breaks are not break opportunities, but mandatory breaks, which break a line no matter whether it would fit the width of the margins.

With ZW, you do get not a hard break (which exists always), but a break opportunity (which only manifests itself when you need to wrap the line there).

With the alternative option, there would be instances where lines wrap and where the second line inexplicably starts with a space, or run of spaces, just because there's a ZW. This sounds like a cool "feature" but it really goes against the whole tradition and rationale for line wrapping.

Lines are broken, because they don't fit the margins. When a line has spaces at the line break point, the spaces are elided (as if they were removed, or left hanging over the margin invisibly). The new line starts with the first non-space character. The alternative option would introduce fundamentally new behavior, not because it's needed, but merely to fix an arcane bug in the rules.

Second: It violates the bug fixing equivalent of Occam's razor. The bug is that "adding space, removes a line break opportunity" in a few, limited contexts. There's no need to suddenly support entirely new break opportunities, as in sequences like:


ZW ÷ SP ÷ ZW ÷ SP + ZW

Third: As proposed at the top of this document, the new rule can be implemented by the pair table. In fact, has been implemented by the pair table since 3.0. Changing LB8 so it becomes ZW SP * ÷ and issuing a corrigendum would bring both specifications (rules and table) into alignment. In contrast, the alternative cannot be realized with the pair table without making some substantial addition to the pair table architecture.

The reason for that limitation is that the pair table is based on the underlying design concept of always eliding spaces at the line breaks. (It is more than likely that any other implementation architecture that handles SP explicitly as a special case, would be adversely affected by any reordering of LB7 and LB8.)

Conclusion

After investigating the proposed bug and alternative options, and including a detailed discussion with Eric Muller, Andy Heninger, and Mark Davis,  I recommend the corrigendum proposed above.

Appendix: How to verify interaction between rules

A rule ending in ÷ overrides any later rule starting with x, but can't effect earlier rules or later rules that are of the form B x A or B + A, or of the form B+.

Rule LB18 (SP ÷) already allows break after SPACE , meaning it overrides any later rule that might start with x. So we only need to look at rules LB9 - LB17.

Among rules LB9 - LB17 there are two that start with x. Those are the ones Eric  gave in his bug report.

They are LB11 (x WJ), and rule LB13, which is

x CL

x EX

etc.

For any class C in either of these rules, we now have (in 5.2.0)  ZW ÷ C but also ZW x SP x C. The latter is the part that is counterintuitive and should be fixed.

All other rules and character classes are unaffected by the proposal.