L2/05-292

Re: Linebreak Issues
From: Mark Davis
Date: $Date: 2005/10/11 00:40:19 $

Andy came across the following page: http://www.cs.tut.fi/%7Ejkorpela/html/nobr.html

It seems to me that the author has a point for a number of issues to do with Line Break. I looked in detail at the issues, and have the following proposal:

1. Don't allow "-case" to break after the hyphen, leaving a lone hyphen at the end of a line

Add a rule:

15b Break after a hyphen, but only if it is in a word.

(AL | NU) (HY | BA) ๗ (AL | NU)
(HY | BA) ื

However, BA is an odd class, containing spaces and tabs, but also hyphen characters. The spaces really shouldn't enter into this behavior, so we should split the class into:

BS spacy stuff

BA hypheny stuff

The only resulting change is that 15 line 1 changes from ื BA

ื (BS | BA)

2. Opening and closing shouldn't break from alphanums

Example: person(s)
Example: ปwie hierซ

So add a rule

8a Don't break between alphanumerics and opening or closing punctuation

(AL | NU) ื OP
CL ื (AL | NU)

3. The characters ฐ and % shouldn't break from following alphanumerics

Example: ฐC
Example: %E0 (used in URLs)

Note that IE doesn't allow the second break. Are there other customizations that IE makes that we should look at?

This is part of a broader problem that the UTC asked me to look at. The problem is that right now the PR and PO stuff really has to be overridden per language. These two classes are disjoint, and only one of them can have currency symbols and other numeric stuff in them. I have the choice of not breaking "$123" or not breaking "123$"; of not breaking "-12" or not breaking "12-" (with real minus sign). This is really unexpected behavior for users, who would expect none of these to ever break.

The purpose for separating PR and PO is only because of ideographs, which wouldn't use spaces. So I suggest having the main numeric "keep together" rules just use both in either position. That is, change:

 PR ? ( OP | HY ) ? NU (NU | SY | IS) * CL PO ?

to

(PO | PR) ? ( OP | HY ) ? NU (NU | SY | IS) * CL ?   (PO | PR) ?

and change

CL ื PO
NU ื PO
PR ื OP
PR ื NU
PR ื HY
PR ื AL

To

(CL | NU) ื (PO | PR)
(PO | PR) ื (OP | NU | HY | AL)

4. URLs

Mr. Korpela complains about URLs not breaking well in browsers. Except for the %xx problem referenced above, I think we are in fairly good shape. Here are the breaks in a sample URL (| marks the break opportunities).

http://|www.cs.tut.fi/|%|7Ejkorpela/|html/|nobr.html?|abcd=high&hijk=low#anchor|

So the issue here is just that the browsers aren't following the UAX, so long URLs aren't wrapping.

5. Conformance

I looked at conformance in detail, and here are what I think the problems are:

A. The spec says in its conformance section:

All line breaking classes are informative, except for the line breaking classes marked with a * in Table 1 Line Breaking Properties. The interpretation of characters with normative line breaking classes by all conforming implementations must be consistent with the specification of the normative property.

Conformant implementations must not tailor characters with normative line breaking classes to any of the informative line breaking classes, but may tailor characters with informative line breaking classes to one of the normative line breaking classes.

Higher-level protocols may further restrict, override, or extend the line breaking classes of certain characters in some contexts.

But the conformance clauses themselves do not limit the overrideability of values.

B. The following is listed as normative, and thus not overrideable ( http://www.unicode.org/reports/tr14/#Definitions  ).

Table 1: Line Breaking Classes (* = normative) 

BK *    Mandatory Break    NL, PS

...

SG * Surrogates

...

CB * Contingent Break Opportunity    Inline Objects

But then LB1 specifically says:

LB 1  Assign a line breaking class to each code point of the input. Resolve AI, CB, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.

And BK includes the following, which explicitly allows choice:

“NEW LINE FUNCTION (NLF)”

New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of the control characters NEL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in [Unicode] Section 5.8, Newline Guidelines.

C. Moreover, clearly I have to be able to override non-normative stuff by splitting classes or adding classes. What does it really mean to say that a property value is normative? Does that mean that in its interactions with everything other value, normative and informative, must be described in the rules? Or only with other normative values? For example:

Rule 8 says: ื CL

Rule 12 says: SP ๗

What does SP being normative mean in these cases? What happens if I tailor CL to contain a character that it didn't before, or exclude a character it didn't have? That would change how SP breaks? The problem is any mixture of the "normative" and specific "informative" values is a real problem. The rules with only normative values are ok. But because the rules are done in order, any rule with an informative value followed by another rule with a normative value is a problem (subject to allowable rearrangement).

("Allowable rearrangement"? What the heck is he talking about? It is that if I have any list of rules that have the same break status (ื or ๗), their order doesn't matter. So we could reorder all the normative rules to be ahead of informative rules if they have the same break status. But Rule 8 and 12 don't.)

Moreover, all word processors allow people to override breaks; not breaking within stuff that has spaces, breaking within what would otherwise be words. We have to reflect this.

Using normative/informative values just doesn't work. I only see two real possibilities (and favor the first):

  1. Allow arbitrary tailoring: dump the text that tries to make a distinction between normative and informative, and say that all the properties are overrideable normative properties.
  2. If we want to rescue some limitations, say that the rules 3a, 3b, 3c, 4, 5 are normative: any conformant linebreak implementation has to break or not break at the places specified by these rules with untailored property values. Otherwise implementations are free to break/not-break where they want.

LB 3a  Always break after hard line breaks (but never between CR and LF).

BK !

LB 3b  Treat CR followed by LF, as well as CR, LF and NL as hard line breaks.

CR ื LF

CR !

LF !

NL !

LB 3c  Do not break before hard line breaks.

ื ( BK | CR | LF | NL )

LB 4  Do not break before spaces or zero-width space.

ื SP

ื ZW

LB 5  Break after zero-width space.

ZW ๗

The one other step we would have to take would be to disallow tailoring of BK as listed in the text. Either that, or break out the BK rules above so that they can be non-normative.

Why pick these rules? Because they don't involve any informative values, and do represent some sort of 'hard' conditions.

Side note: the phrase "Where X is any line break class except SP, BK, CR, LF, NL or ZW." should be moved up and made italic. It is really part of rule 7b, but the formatting makes it look like it is not.


For comparison, here is a generated version of the rules, with "normative" on each line where a normative property value; and bolding the lines so they stand out (These are in CLDR format, so there are some differences in formatting and rules, although the results are the same as the UAX.)

<!-- LB 3a Always break after hard line breaks (but never between CR and LF). -->
<rule id="3.1" normative="true"> $BK ๗ </rule>
<!-- LB 3b Treat CR followed by LF, as well as CR, LF and NL as hard line breaks. -->
<rule id="3.21" normative="true"> $CR ื $LF </rule>
<rule id="3.22" normative="true"> $CR ๗ </rule>
<rule id="3.23" normative="true"> $LF ๗ </rule>
<rule id="3.24" normative="true"> $NL ๗ </rule>
<!-- LB 3c Do not break before hard line breaks. -->
<rule id="3.3" normative="true"> ื ( $BK | $CR | $LF | $NL ) </rule>
<!-- LB 4 Do not break before spaces or zero-width space. -->
<rule id="4.01" normative="true"> ื $SP </rule>
<rule id="4.02" normative="true"> ื $ZW </rule>
<!-- LB 5 Break after zero-width space. -->
<rule id="5" normative="true"> $ZW ๗ </rule>
<!-- LB 7b Do not break a combining character sequence; treat it as if it has the LB class of the base character in all of the following rules.  (Where X is any line break class except SP, BK, CR, LF, NL or ZW.)-->
<rule id="7.2" normative="true"> ื $CM </rule>
<!-- WARNING: this is done by modifying the variable values for all but SP.... That is, $AL is really ($AI $CM*)! -->
<!-- LB 8 Do not break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces. -->
<rule id="8.01"> ื $CL </rule>
<rule id="8.02"> ื $EX </rule>
<rule id="8.03"> ื $IS </rule>
<rule id="8.04"> ื $SY </rule>
<!-- LB 9 Do not break after ‘[’, even after spaces. -->
<rule id="9" normative="true"> $OP $SP* ื </rule>
<!-- LB 10 Do not break within ‘"[’, even with intervening spaces. -->
<rule id="10" normative="true"> $QU $SP* ื $OP </rule>
<!-- LB 11 Do not break within ‘]h’, even with intervening spaces. -->
<rule id="11" normative="true"> $CL $SP* ื $NS </rule>
<!-- LB 11a Do not break within ‘——’, even with intervening spaces. -->
<rule id="11.1" normative="true"> $B2 $SP* ื $B2 </rule>
<!-- LB 11b Do not break before or after WORD JOINER and related characters. -->
<rule id="11.21" normative="true"> ื $WJ </rule>
<rule id="11.22" normative="true"> $WJ ื </rule>
<!-- LB 12 Break after spaces. -->
<rule id="12" normative="true"> $SP ๗ </rule>
<!-- LB 13 Do not break before or after NBSP and related characters. -->
<rule id="13.01" normative="true"> ื $GL </rule>
<rule id="13.02" normative="true"> $GL ื </rule>
<!-- LB 14 Do not break before or after ‘"’. -->
<rule id="14.01"> ื $QU </rule>
<rule id="14.02"> $QU ื </rule>
<!-- LB 14a Break before and after unresolved CB. -->
<rule id="14.12" normative="true"> ๗ $CB </rule>
<rule id="14.13" normative="true"> $CB ๗ </rule>
<!-- LB 15 Do not break before hyphen-minus, other hyphens, fixed-width spaces, small kana and other non-starters, or after acute accents. -->
<rule id="15.01"> ื $BA </rule>
<rule id="15.02"> ื $HY </rule>
<rule id="15.03"> ื $NS </rule>
<rule id="15.04"> $BB ื </rule>
<!-- LB 16 Do not break between two ellipses, or between letters or numbers and ellipsis. -->
<rule id="16.01"> $AL ื $IN </rule>
<rule id="16.02"> $ID ื $IN </rule>
<rule id="16.03"> $IN ื $IN </rule>
<rule id="16.04"> $NU ื $IN </rule>
<!-- LB 17 Do not break within ‘a9’, ‘3a’, or ‘H%’. -->
<rule id="17.01"> $ID ื $PO </rule>
<rule id="17.02"> $AL ื $NU </rule>
<rule id="17.03"> $NU ื $AL </rule>
<!-- LB 18 Do not break between the following pairs of classes. -->
<!-- Using customization 7!! -->
<!-- LB 18-alternative: $PR? ( $OP | $HY )? $NU ($NU | $SY | $IS)* $CL? $PO? -->
<!-- Insert ื every place it could go. However, make sure that at least one thing is concrete, otherwise would cause $NU to not break before or after -->
<rule id="18.111"> $PR ื ( $OP | $HY )? $NU </rule>
<rule id="18.112"> ( $OP | $HY ) ื $NU </rule>
<rule id="18.113"> $NU ื ($NU | $SY | $IS) </rule>
<rule id="18.114"> $NU ($NU | $SY | $IS)* ื ($NU | $SY | $IS) </rule>
<rule id="18.115"> $NU ($NU | $SY | $IS)* $CL? ื $PO </rule>
<!-- 18.11) $CL ื $PO -->
<!-- 18.12) $HY ื $NU -->
<!-- 18.13) $IS ื $NU -->
<!-- 18.13) $NU ื $NU -->
<!-- 18.14) $NU ื $PO -->
<rule id="18.15"> $PR ื $AL </rule>
<!-- 18.16) $PR ื $HY -->
<rule id="18.17"> $PR ื $ID </rule>
<!-- 18.18) $PR ื $NU -->
<!-- 18.19) $PR ื $OP -->
<!-- 18.195) $SY ื $NU -->
<!-- LB 18b Do not break a Korean syllable. -->
<rule id="18.21"> $JL ื $JL | $JV | $H2 | $H3 </rule>
<rule id="18.22"> $JV | $H2 ื $JV | $JT </rule>
<rule id="18.23"> $JT | $H3 ื $JT </rule>
<!-- LB 18c Treat a Korean Syllable Block the same as ID. -->
<rule id="18.31"> $JL | $JV | $JT | $H2 | $H3 ื $IN </rule>
<rule id="18.32"> $JL | $JV | $JT | $H2 | $H3 ื $PO </rule>
<rule id="18.33"> $PR ื $JL | $JV | $JT | $H2 | $H3 </rule>
<!-- LB 19 Do not break between alphabetics ("at"). -->
<rule id="19"> $AL ื $AL </rule>
<!-- LB 19b Do not break between numeric punctuation and alphabetics ("e.g."). -->
<rule id="19.1"> $IS ื $AL </rule>