Date 10//11/05                            L2/05-314

Source: Asmus Freytag

This document provides comments on Mark's proposals re:/ linebreaking. It is based on the text of an advance copy he sent me, apologies if the L2 version has some last minute changes. The original document is shown indented and with sidebars. My comments and recommendations are shown in blue text.

Re: Linebreak Issues
From: Mark Davis
Date: $Date: 2005/10/11 00:40:19 $

Andy came across the following page: http://www.cs.tut.fi/%7Ejkorpela/html/nobr.html

Mark proposed:

1. Don't allow "-case" to break after the hyphen, leaving a lone hyphen at the end of a line

Add a rule:

15b Break after a hyphen, but only if it is in a word.

(AL | NU) (HY | BA) ๗ (AL | NU)
(HY | BA) ื

However, BA is an odd class, containing spaces and tabs, but also hyphen characters. The spaces really shouldn't enter into this behavior, so we should split the class into:

BS spacy stuff

BA hypheny stuff

The only resulting change is that 15 line 1 changes from ื BA

ื (BS | BA)

There are two problems with making this change:

  1. While it may be 'ideal' to not break words with leading hyphen, it contradicts widespread legacy practice. Word will cheerfully put an isolated hyphen at the end of a line. There is no realistic way that MS could change this behavior, even if they agreed that it would be desirable, so introducing this change would only serve to put Unicode at odds with actual practice.
  2. As proposed, the rule adds a three character context. A three character context cannot be modeled with a pair table, unless the middle part of the context is SPACE (0x20).

Therefore, I recommend that this proposal be rejected for the default algorithm - however, I have nothing against noting this as a possible tailoring.

Mark proposes:

2. Opening and closing shouldn't break from alphanums

Example: person(s)
Example: ปwie hierซ

So add a rule

8a Don't break between alphanumerics and opening or closing punctuation

(AL | NU) ื OP
CL ื (AL | NU)

Two things should be noted for this proposal

  1. The legacy practice for this situation is as follows: Word keeps "persons(s)" together, IE does not.
  2. The second example is incorrect, as the property for the guillemet is QU, which already behaves as requested.

Therefore, the second example should be removed before considering the proposal. As the legacy practice is split on this issue, I could support changing the default algorithm in this instance.

However, we could also implement this as a recommended tailoring. This would have the advantage that we wouldn't shift the ground from underneath those implementations that have done the algorithm (or nearly so).

Mark proposes:

3. The characters ฐ and % shouldn't break from following alphanumerics

Example: ฐC
Example: %E0 (used in URLs)

Note that IE doesn't allow the second break. Are there other customizations that IE makes that we should look at?

This is part of a broader problem that the UTC asked me to look at. The problem is that right now the PR and PO stuff really has to be overridden per language. These two classes are disjoint, and only one of them can have currency symbols and other numeric stuff in them. I have the choice of not breaking "$123" or not breaking "123$"; of not breaking "-12" or not breaking "12-" (with real minus sign). This is really unexpected behavior for users, who would expect none of these to ever break.

The purpose for separating PR and PO is only because of ideographs, which wouldn't use spaces. So I suggest having the main numeric "keep together" rules just use both in either position. That is, change:

 PR ? ( OP | HY ) ? NU (NU | SY | IS) * CLPO ?

to

(PO | PR) ? ( OP | HY ) ? NU (NU | SY | IS) * CL ?   (PO | PR) ?

and change

CL ื PO
NU ื PO
PR ื OP
PR ื NU
PR ื HY
PR ื AL

To

(CL | NU) ื (PO | PR)
(PO | PR) ื (OP | NU | HY | AL)

The legacy practice here is mixed. Again, Word treats both $ and % as glueing to adjacent alphanumerics, in fact, even to adjacent ideographs, while IE follows the existing linebreak algorithm.

Again, the proposed change is reasonable, and since there are differences in actual implementations, it would seem OK to make this change.

However, if there are implementations that are 'conformant' today, we might consider making this a recommended tailoring instead. The difference is that it does not withdraw conformant status from existing implementations.

Mark comments:

4. URLs                                   

Mr. Korpela complains about URLs not breaking well in browsers. Except for the %xx problem referenced above, I think we are in fairly good shape. Here are the breaks in a sample URL (| marks the break opportunities).

http://|www.cs.tut.fi/|%|7Ejkorpela/|html/|nobr.html?|abcd=high&hijk=low#anchor|

So the issue here is just that the browsers aren't following the UAX, so long URLs aren't wrapping.

The main difference is that they don't allow break after '/'.

Note, I would regard any improved treatment of %xx for URLs as a coincidental side effect. My position on URLs is that they should be recognized and tailored explicitly by the application. The reason for that is that fashions in addressing are likely to change much more rapidly than other features of common orthographies, so building URL awareness into the default algorithm should be a non-goal.

Mark observes:

5. Conformance

I looked at conformance in detail, and here are what I think the problems are:

A. The spec says in its conformance section:

All line breaking classes are informative, except for the line breaking classes marked with a * in Table 1 Line Breaking Properties. The interpretation of characters with normative line breaking classes by all conforming implementations must be consistent with the specification of the normative property.

Conformant implementations must not tailor characters with normative line breaking classes to any of the informative line breaking classes, but may tailor characters with informative line breaking classes to one of the normative line breaking classes.

Higher-level protocols may further restrict, override, or extend the line breaking classes of certain characters in some contexts.

But the conformance clauses themselves do not limit the overrideability of values.

B. The following is listed as normative, and thus not overrideable ( http://www.unicode.org/reports/tr14/#Definitions  ).

Table 1: Line Breaking Classes (* = normative) 

BK *    Mandatory Break    NL, PS

...

SG * Surrogates

...

CB * Contingent Break Opportunity    Inline Objects

But then LB1 specifically says:

LB 1  Assign a line breaking class to each code point of the input. Resolve AI, CB, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.

SG -- this class is a holdover from when the algorithm was defined in terms of unaware UTF-16 support. Essentially, SG is essentially an error condition, but we don't require implementations to flag this error. Rule 1 puts the behavior of unpaired surrogates outside the scope of the algorithm, while allowing implementations to define the fallback behavior for them. That is the intended normative behavior for SG.

CB -- is the object replacement character. It's a weird puppy. 'nuff said. Actually, it's function is to behave in a way that's outside the scope of the algorithm, but it's not an error condition, so that's the reason it's not part of class SG.

And BK includes the following, which explicitly allows choice:

“NEW LINE FUNCTION (NLF)”

New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of the control characters NEL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in [Unicode] Section 5.8, Newline Guidelines.

BK -- Originally this contained the two special characters, PS, and LS which are untailorable BK, together with CR, LF, NEL and FF. The original design left open to choice how to treat legacy use of control codes for line ending. Currently, the text quoted above makes no sense, as CR, LF and NL have been moved out of the BK class (and new rules have been written for CR, LF and NL that enshrine the newline *guidelines* as normative behavior that cannot be overridden).

Independent of the rest of the analysis the text above is a holdover and needs to be adjusted to the new membership of BK.

The other normative classes not yet mentioned are:

GL - non-breaking characters

WJ -- Word Joiner

ZW -- Zero width space (word breaker)

In all cases, these characters were designed to influence linebreaking behavior. To insure that they are implemented as intended, these classes are considered normative.

C. Moreover, clearly I have to be able to override non-normative stuff by splitting classes or adding classes. What does it really mean to say that a property value is normative? Does that mean that in its interactions with everything other value, normative and informative, must be [as] described in the rules? Or only with other normative values? For example:

Rule 8 says: ื CL

Rule 12 says: SP ๗

As currently intended, means that the normative behavior of SP is that there is always a break after it, unless overriden by the "don't break before me" property of another character. The aspect that we are trying to model here is that it is SP and not some other character that serves as the generic "allow a break by default" character. Tailoring SP so that it would act like a letter, or like NBSP, is the kind of scenario that we feel should not be allowed.

What does SP being normative mean in these cases? What happens if I tailor CL to contain a character that it didn't before, or exclude a character it didn't have? That would change how SP breaks?

It does and it doesn't. As we are are allowing characters to not allow a break before them, this situation is fine (even if not elegantly stated). In some sense, this is an artifact of re-expressing the algorithm (which was originally conceived and implemented via pair table) as a set of cascading rules. In the pair table, the special role of SP is built into the infrastructure so to speak.

The problem is any mixture of the "normative" and specific "informative" values is a real problem. The rules with only normative values are ok. But because the rules are done in order, any rule with an informative value followed by another rule with a normative value is a problem (subject to allowable rearrangement).

("Allowable rearrangement"? What the heck is he talking about? It is that if I have any list of rules that have the same break status (ื or ๗), their order doesn't matter. So we could reorder all the normative rules to be ahead of informative rules if they have the same break status. But Rule 8 and 12 don't.)

Moreover, all word processors allow people to override breaks; not breaking within stuff that has spaces, breaking within what would otherwise be words. We have to reflect this.

Using normative/informative values just doesn't work. I only see two real possibilities (and favor the first):

  1. Allow arbitrary tailoring: dump the text that tries to make a distinction between normative and informative, and say that all the properties are overrideable normative properties.
  2. If we want to rescue some limitations, say that the rules 3a, 3b, 3c, 4, 5 are normative: any conformant linebreak implementation has to break or not break at the places specified by these rules with untailored property values. Otherwise implementations are free to break/not-break where they want.

The fact that Mark needs to refer to untailored here, is indicative of a problem with his approach, see below.

LB 3a  Always break after hard line breaks (but never between CR and LF).

BK !

The "but never" clause is now obsolete as it is now accomplished by rule 3b.

LB 3b  Treat CR followed by LF, as well as CR, LF and NL as hard line breaks.

CR ื LF

CR !

LF !

NL !

In the newline guidelines these were guidelines. Here they have become normative and non-tailorable. For the sake of an example, I can see nothing wrong with writing an application that does not support the IBM NEL line ending convention. However, I agree that by default, supporting all legacy convention makes for a robust implementation.

LB 3c  Do not break before hard line breaks.

ื ( BK | CR | LF | NL )

LB 4  Do not break before spaces or zero-width space.

ื SP

ื ZW

LB 5  Break after zero-width space.

ZW ๗

The one other step we would have to take would be to disallow tailoring of BK as listed in the text. Either that, or break out the BK rules above so that they can be non-normative.

Why pick these rules? Because they don't involve any informative values, and do represent some sort of 'hard' conditions.

This set of rules would not capture the normative characteristic of the nonbreaking characters (class GL), nor the normative behavior of WJ. The following three rules deal with the normative behavior of GL, WJ and SP. The order of these three rules ensures that SP cannot override the word joining action of WJ, but can override the non-breaking action of a GL.

LB 11b  Do not break before or after WORD JOINER and related characters.

ื WJ

WJ ื

LB 12  Break after spaces.

SP ๗

 

LB 13  Do not break before or after NBSP and related characters.

ื GL

GL ื

 

Analysis

I agree with Mark's analysis insofar as determination of which rules are normative really doesn't follow directly from the way we've divided classes into normative (non-overridable) and informative. However, the problem with not asserting any normative values is that it is impossible to talk about WJ, GL etc. unless the membership of these classes is fixed, and Mark explicitly recognizes this above.

After cleaning up some of the loose ends that were discovered in the discussion, it is possible to come to a coherent proposal as follows:

How to set limits on tailoring

As before (perhaps after some slight adjustment) certain classes are normative and have required membership. All other classes are informative (or also normative, doesn't matter), but can be overriden arbitrarily. In addition, like in Mark's proposal, a number or rules are normative, and may not be overridden. However, these rules must include LB11b, LB 12 and LB13.

Finally there exists a new a meta-rule that would limit tailoring as follows: New classes and new rules may be introduced by tailoring, and overridable classes may be overridden, except that tailoring may not introduce any rule between rule 5 and 11b that starts or ends in a ๗ or contains a term with one of the normative classes, SP, WJ, ZW, and GL.

The effect of this meta rule is that the only rules that can be introduced are those that prevent line breaking after a SP character (a legitimate tailoring) or those that could be rearranged to a different position.

Some additional comments:

Rule LB14, about 'unresolved CB' really doesn't belong, because LB1 requires that all CB's are resolved. However, it gives an effective default resolution for CB and one that may not match any existing LB class. I would agree that this default resolution has no business being normative, so rule LB14 should not be a normative rule.

This leaves one more normative class "CM" which is an amalgamation of control codes, default ignorables and combining marks. Disallowing the tailoring for combining marks would make some sense, but the default treatment of control sequences and default ignorables is merely some sort of 'best practice'. However, unless there's evidence that such tailoring is both needed (and in fact observable) we might keep this as a normative rule:

LB 7b  Do not break a combining character sequence; treat it as if it has the LB class of the base character in all of the following rules.

Treat X CM* as if it were X, where X is any line break class except SP, BK, CR, LF, NL or ZW.

Alternatively, we could allow partial tailoring of the CM class to allow those characters that are not combining characters to be tailored in other ways, while preventing the tailoring of combining marks.

Mark made the following editorial comment.

Side note: the phrase "Where X is any line break class except SP, BK, CR, LF, NL or ZW." should be moved up and made italic. It is really part of rule 7b, but the formatting makes it look like it is not.

The formatting is indeed incorrect, the sentence needs to be centered, or even joined with the 'treat x..." sentence. But the italic text are not the rules, only mnemonics for the rules. (I've shown this above).


For comparison, here is a generated version of the rules, with "normative" on each line where a normative property value; and bolding the lines so they stand out (These are in CLDR format, so there are some differences in formatting and rules, although the results are the same as the UAX.)

<!-- LB 3a Always break after hard line breaks (but never between CR and LF). -->
<rule id="3.1" normative="true"> $BK ๗ </rule>
<!-- LB 3b Treat CR followed by LF, as well as CR, LF and NL as hard line breaks. -->
<rule id="3.21" normative="true"> $CR ื $LF </rule>
<rule id="3.22" normative="true"> $CR ๗ </rule>
<rule id="3.23" normative="true"> $LF ๗ </rule>
<rule id="3.24" normative="true"> $NL ๗ </rule>
<!-- LB 3c Do not break before hard line breaks. -->
<rule id="3.3" normative="true"> ื ( $BK | $CR | $LF | $NL ) </rule>
<!-- LB 4 Do not break before spaces or zero-width space. -->
<rule id="4.01" normative="true"> ื $SP </rule>
<rule id="4.02" normative="true"> ื $ZW </rule>
<!-- LB 5 Break after zero-width space. -->
<rule id="5" normative="true"> $ZW ๗ </rule>
<!-- LB 7b Do not break a combining character sequence; treat it as if it has the LB class of the base character in all of the following rules.  (Where X is any line break class except SP, BK, CR, LF, NL or ZW.)-->
<rule id="7.2" normative="true"> ื $CM </rule>
<!-- WARNING: this is done by modifying the variable values for all but SP.... That is, $AL is really ($AI $CM*)! -->
<!-- LB 8 Do not break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces. -->
<rule id="8.01"> ื $CL </rule>
<rule id="8.02"> ื $EX </rule>
<rule id="8.03"> ื $IS </rule>
<rule id="8.04"> ื $SY </rule>
<!-- LB 9 Do not break after ‘[’, even after spaces. -->
<rule id="9" normative="true"> $OP $SP* ื </rule>
<!-- LB 10 Do not break within ‘"[’, even with intervening spaces. -->
<rule id="10" normative="true"> $QU $SP* ื $OP </rule>
<!-- LB 11 Do not break within ‘]h’, even with intervening spaces. -->
<rule id="11" normative="true"> $CL $SP* ื $NS </rule>
<!-- LB 11a Do not break within ‘——’, even with intervening spaces. -->
<rule id="11.1" normative="true"> $B2 $SP* ื $B2 </rule>
<!-- LB 11b Do not break before or after WORD JOINER and related characters. -->
<rule id="11.21" normative="true"> ื $WJ </rule>
<rule id="11.22" normative="true"> $WJ ื </rule>
<!-- LB 12 Break after spaces. -->
<rule id="12" normative="true"> $SP ๗ </rule>
<!-- LB 13 Do not break before or after NBSP and related characters. -->
<rule id="13.01" normative="true"> ื $GL </rule>
<rule id="13.02" normative="true"> $GL ื </rule>
<!-- LB 14 Do not break before or after ‘"’. -->
<rule id="14.01"> ื $QU </rule>
<rule id="14.02"> $QU ื </rule>
<!-- LB 14a Break before and after unresolved CB. -->
<rule id="14.12" normative="true"> ๗ $CB </rule>
<rule id="14.13" normative="true"> $CB ๗ </rule>
<!-- LB 15 Do not break before hyphen-minus, other hyphens, fixed-width spaces, small kana and other non-starters, or after acute accents. -->
<rule id="15.01"> ื $BA </rule>
<rule id="15.02"> ื $HY </rule>
<rule id="15.03"> ื $NS </rule>
<rule id="15.04"> $BB ื </rule>
<!-- LB 16 Do not break between two ellipses, or between letters or numbers and ellipsis. -->
<rule id="16.01"> $AL ื $IN </rule>
<rule id="16.02"> $ID ื $IN </rule>
<rule id="16.03"> $IN ื $IN </rule>
<rule id="16.04"> $NU ื $IN </rule>
<!-- LB 17 Do not break within ‘a9’, ‘3a’, or ‘H%’. -->
<rule id="17.01"> $ID ื $PO </rule>
<rule id="17.02"> $AL ื $NU </rule>
<rule id="17.03"> $NU ื $AL </rule>
<!-- LB 18 Do not break between the following pairs of classes. -->
<!-- Using customization 7!! -->
<!-- LB 18-alternative: $PR? ( $OP | $HY )? $NU ($NU | $SY | $IS)* $CL? $PO? -->
<!-- Insert ื every place it could go. However, make sure that at least one thing is concrete, otherwise would cause $NU to not break before or after -->
<rule id="18.111"> $PR ื ( $OP | $HY )? $NU </rule>
<rule id="18.112"> ( $OP | $HY ) ื $NU </rule>
<rule id="18.113"> $NU ื ($NU | $SY | $IS) </rule>
<rule id="18.114"> $NU ($NU | $SY | $IS)* ื ($NU | $SY | $IS) </rule>
<rule id="18.115"> $NU ($NU | $SY | $IS)* $CL? ื $PO </rule>
<!-- 18.11) $CL ื $PO -->
<!-- 18.12) $HY ื $NU -->
<!-- 18.13) $IS ื $NU -->
<!-- 18.13) $NU ื $NU -->
<!-- 18.14) $NU ื $PO -->
<rule id="18.15"> $PR ื $AL </rule>
<!-- 18.16) $PR ื $HY -->
<rule id="18.17"> $PR ื $ID </rule>
<!-- 18.18) $PR ื $NU -->
<!-- 18.19) $PR ื $OP -->
<!-- 18.195) $SY ื $NU -->
<!-- LB 18b Do not break a Korean syllable. -->
<rule id="18.21"> $JL ื $JL | $JV | $H2 | $H3 </rule>
<rule id="18.22"> $JV | $H2 ื $JV | $JT </rule>
<rule id="18.23"> $JT | $H3 ื $JT </rule>
<!-- LB 18c Treat a Korean Syllable Block the same as ID. -->
<rule id="18.31"> $JL | $JV | $JT | $H2 | $H3 ื $IN </rule>
<rule id="18.32"> $JL | $JV | $JT | $H2 | $H3 ื $PO </rule>
<rule id="18.33"> $PR ื $JL | $JV | $JT | $H2 | $H3 </rule>
<!-- LB 19 Do not break between alphabetics ("at"). -->
<rule id="19"> $AL ื $AL </rule>
<!-- LB 19b Do not break between numeric punctuation and alphabetics ("e.g."). -->
<rule id="19.1"> $IS ื $AL </rule>