Date 10//11/05

1. Don't allow "-case" to break after the hyphen, leaving a lone hyphen at the end of a line

Add a rule:

15b Break after a hyphen, but only if it is in a word.

(AL | NU) (HY | BA) ÷ (AL | NU)
(HY | BA) ×

However, BA is an odd class, containing spaces and tabs, but also hyphen characters. The spaces really shouldn't enter into this behavior, so we should split the class into:

BS spacy stuff

BA hypheny stuff

The only resulting change is that 15 line 1 changes from × BA

× (BS | BA)

3. The characters ° and % shouldn't break from following alphanumerics

Example: °C
Example: %E0 (used in URLs)

Note that IE doesn't allow the second break. Are there other customizations that IE makes that we should look at?

This is part of a broader problem that the UTC asked me to look at. The problem is that right now the PR and PO stuff really has to be overridden per language. These two classes are disjoint, and only one of them can have currency symbols and other numeric stuff in them. I have the choice of not breaking "$123" or not breaking "123$"; of not breaking "-12" or not breaking "12-" (with real minus sign). This is really unexpected behavior for users, who would expect none of these to ever break.

The purpose for separating PR and PO is only because of ideographs, which wouldn't use spaces. So I suggest having the main numeric "keep together" rules just use both in either position. That is, change:

PR ? ( OP | HY ) ? NU (NU | SY | IS) * CL ? PO ?

to

(PO | PR) ? ( OP | HY ) ? NU (NU | SY | IS) * CL ? (PO | PR) ?

and change

CL × PO
NU × PO
PR × OP
PR × NU
PR × HY
PR × AL

To

(CL | NU) × (PO | PR)
(PO | PR) × (OP | NU | HY | AL)

4. URLs

Mr. Korpela complains about URLs not breaking well in browsers. Except for the %xx problem referenced above, I think we are in fairly good shape. Here are the breaks in a sample URL (| marks the break opportunities).
http://|www.cs.tut.fi/|%|7Ejkorpela/|html/|nobr.html?|abcd=high&hijk=low#anchor|
So the issue here is just that the browsers aren't following the UAX, so long URLs aren't wrapping.

Note, I would regard any improved treatment of %xx for URLs as a coincidental side effect. My position on URLs is that they should be recognized and tailored explicitly by the application. The reason for that is that fashions in addressing are likely to change much more rapidly than other features of common orthographies, so building URL awareness into the default algorithm should be a non-goal.

5. Conformance

I looked at conformance in detail, and here are what I think the problems are:

A. The spec says in its conformance section:

All line breaking classes are informative, except for the line breaking classes marked with a * in Table 1 Line Breaking Properties. The interpretation of characters with normative line breaking classes by all conforming implementations must be consistent with the specification of the normative property.

Conformant implementations must not tailor characters with normative line breaking classes to any of the informative line breaking classes, but may tailor characters with informative line breaking classes to one of the normative line breaking classes.

Higher-level protocols may further restrict, override, or extend the line breaking classes of certain characters in some contexts.

But the conformance clauses themselves do not limit the overrideability of values.

B. The following is listed as normative, and thus not overrideable ( http://www.unicode.org/reports/tr14/#Definitions ).

Table 1: Line Breaking Classes (* = normative)

BK * Mandatory Break NL, PS

...

SG * Surrogates

...

CB * Contingent Break Opportunity Inline Objects

But then LB1 specifically says:

LB 1 Assign a line breaking class to each code point of the input. Resolve AI, CB, SA, SG, and XX into other line breaking classes depending on criteria outside the scope of this algorithm.

SG -- this class is a holdover from when the algorithm was defined in terms of unaware UTF-16 support. Essentially, SG is essentially an error condition, but we don't require implementations to flag this error. Rule 1 puts the behavior of unpaired surrogates outside the scope of the algorithm, while allowing implementations to define the fallback behavior for them. That is the intended normative behavior for SG.

CB -- is the object replacement character. It's a weird puppy. 'nuff said. Actually, it's function is to behave in a way that's outside the scope of the algorithm, but it's not an error condition, so that's the reason it's not part of class SG.

And BK includes the following, which explicitly allows choice:

“NEW LINE FUNCTION (NLF)”

New line functions provide additional explicit breaks. They are not individual characters, but are expressed as sequences of the control characters NEL, LF, and CR. What particular sequence(s) form a NLF depends on the implementation and other circumstances as described in [Unicode] Section 5.8, Newline Guidelines.

C. Moreover, clearly I have to be able to override non-normative stuff by splitting classes or adding classes. What does it really mean to say that a property value is normative? Does that mean that in its interactions with everything other value, normative and informative, must be [as] described in the rules? Or only with other normative values? For example:

Rule 8 says: × CL

Rule 12 says: SP ÷

As currently intended, means that the normative behavior of SP is that there is always a break after it, unless overriden by the "don't break before me" property of another character. The aspect that we are trying to model here is that it is SP and not some other character that serves as the generic "allow a break by default" character. Tailoring SP so that it would act like a letter, or like NBSP, is the kind of scenario that we feel should not be allowed.

It does and it doesn't. As we are are allowing characters to not allow a break before them, this situation is fine (even if not elegantly stated). In some sense, this is an artifact of re-expressing the algorithm (which was originally conceived and implemented via pair table) as a set of cascading rules. In the pair table, the special role of SP is built into the infrastructure so to speak.

The problem is any mixture of the "normative" and specific "informative" values is a real problem. The rules with only normative values are ok. But because the rules are done in order, any rule with an informative value followed by another rule with a normative value is a problem (subject to allowable rearrangement).

("Allowable rearrangement"? What the heck is he talking about? It is that if I have any list of rules that have the same break status (× or ÷), their order doesn't matter. So we could reorder all the normative rules to be ahead of informative rules if they have the same break status. But Rule 8 and 12 don't.)

Moreover, all word processors allow people to override breaks; not breaking within stuff that has spaces, breaking within what would otherwise be words. We have to reflect this.

Using normative/informative values just doesn't work. I only see two real possibilities (and favor the first):

Allow arbitrary tailoring: dump the text that tries to make a distinction between normative and informative, and say that all the properties are overrideable normative properties.

If we want to rescue some limitations, say that the rules 3a, 3b, 3c, 4, 5 are normative: any conformant linebreak implementation has to break or not break at the places specified by these rules with untailored property values. Otherwise implementations are free to break/not-break where they want.

The fact that Mark needs to refer to untailored here, is indicative of a problem with his approach, see below.

In the newline guidelines these were guidelines. Here they have become normative and non-tailorable. For the sake of an example, I can see nothing wrong with writing an application that does not support the IBM NEL line ending convention. However, I agree that by default, supporting all legacy convention makes for a robust implementation.

LB 3c Do not break before hard line breaks.

× ( BK | CR | LF | NL )

LB 4 Do not break before spaces or zero-width space.

× SP

× ZW

LB 5 Break after zero-width space.

ZW ÷

The one other step we would have to take would be to disallow tailoring of BK as listed in the text. Either that, or break out the BK rules above so that they can be non-normative.

Why pick these rules? Because they don't involve any informative values, and do represent some sort of 'hard' conditions.

This set of rules would not capture the normative characteristic of the nonbreaking characters (class GL), nor the normative behavior of WJ. The following three rules deal with the normative behavior of GL, WJ and SP. The order of these three rules ensures that SP cannot override the word joining action of WJ, but can override the non-breaking action of a GL.

I agree with Mark's analysis insofar as determination of which rules are normative really doesn't follow directly from the way we've divided classes into normative (non-overridable) and informative. However, the problem with not asserting any normative values is that it is impossible to talk about WJ, GL etc. unless the membership of these classes is fixed, and Mark explicitly recognizes this above.

After cleaning up some of the loose ends that were discovered in the discussion, it is possible to come to a coherent proposal as follows:

As before (perhaps after some slight adjustment) certain classes are normative and have required membership. All other classes are informative (or also normative, doesn't matter), but can be overriden arbitrarily. In addition, like in Mark's proposal, a number or rules are normative, and may not be overridden. However, these rules must include LB11b, LB 12 and LB13.

Finally there exists a new a meta-rule that would limit tailoring as follows: New classes and new rules may be introduced by tailoring, and overridable classes may be overridden, except that tailoring may not introduce any rule between rule 5 and 11b that starts or ends in a ÷ or contains a term with one of the normative classes, SP, WJ, ZW, and GL.

The effect of this meta rule is that the only rules that can be introduced are those that prevent line breaking after a SP character (a legitimate tailoring) or those that could be rearranged to a different position.

Rule LB14, about 'unresolved CB' really doesn't belong, because LB1 requires that all CB's are resolved. However, it gives an effective default resolution for CB and one that may not match any existing LB class. I would agree that this default resolution has no business being normative, so rule LB14 should not be a normative rule.

This leaves one more normative class "CM" which is an amalgamation of control codes, default ignorables and combining marks. Disallowing the tailoring for combining marks would make some sense, but the default treatment of control sequences and default ignorables is merely some sort of 'best practice'. However, unless there's evidence that such tailoring is both needed (and in fact observable) we might keep this as a normative rule:

Alternatively, we could allow partial tailoring of the CM class to allow those characters that are not combining characters to be tailored in other ways, while preventing the tailoring of combining marks.

The formatting is indeed incorrect, the sentence needs to be centered, or even joined with the 'treat x..." sentence. But the italic text are not the rules, only mnemonics for the rules. (I've shown this above).

For comparison, here is a generated version of the rules, with "normative" on each line where a normative property value; and bolding the lines so they stand out (These are in CLDR format, so there are some differences in formatting and rules, although the results are the same as the UAX.)


<rule id="3.1" normative="true"> $BK ÷ </rule>

<rule id="3.21" normative="true"> $CR × $LF </rule>
<rule id="3.22" normative="true"> $CR ÷ </rule>
<rule id="3.23" normative="true"> $LF ÷ </rule>
<rule id="3.24" normative="true"> $NL ÷ </rule>

<rule id="3.3" normative="true"> × ( $BK | $CR | $LF | $NL ) </rule>

<rule id="4.01" normative="true"> × $SP </rule>
<rule id="4.02" normative="true"> × $ZW </rule>

<rule id="5" normative="true"> $ZW ÷ </rule>

<rule id="7.2" normative="true"> × $CM </rule>


<rule id="8.01"> × $CL </rule>
<rule id="8.02"> × $EX </rule>
<rule id="8.03"> × $IS </rule>
<rule id="8.04"> × $SY </rule>

<rule id="9" normative="true"> $OP $SP* × </rule>

<rule id="10" normative="true"> $QU $SP* × $OP </rule>

<rule id="11" normative="true"> $CL $SP* × $NS </rule>

<rule id="11.1" normative="true"> $B2 $SP* × $B2 </rule>

<rule id="11.21" normative="true"> × $WJ </rule>
<rule id="11.22" normative="true"> $WJ × </rule>

<rule id="12" normative="true"> $SP ÷ </rule>

<rule id="13.01" normative="true"> × $GL </rule>
<rule id="13.02" normative="true"> $GL × </rule>

<rule id="14.01"> × $QU </rule>
<rule id="14.02"> $QU × </rule>

<rule id="14.12" normative="true"> ÷ $CB </rule>
<rule id="14.13" normative="true"> $CB ÷ </rule>

<rule id="15.01"> × $BA </rule>
<rule id="15.02"> × $HY </rule>
<rule id="15.03"> × $NS </rule>
<rule id="15.04"> $BB × </rule>

<rule id="16.01"> $AL × $IN </rule>
<rule id="16.02"> $ID × $IN </rule>
<rule id="16.03"> $IN × $IN </rule>
<rule id="16.04"> $NU × $IN </rule>

<rule id="17.01"> $ID × $PO </rule>
<rule id="17.02"> $AL × $NU </rule>
<rule id="17.03"> $NU × $AL </rule>




<rule id="18.111"> $PR × ( $OP | $HY )? $NU </rule>
<rule id="18.112"> ( $OP | $HY ) × $NU </rule>
<rule id="18.113"> $NU × ($NU | $SY | $IS) </rule>
<rule id="18.114"> $NU ($NU | $SY | $IS)* × ($NU | $SY | $IS) </rule>
<rule id="18.115"> $NU ($NU | $SY | $IS)* $CL? × $PO </rule>





<rule id="18.15"> $PR × $AL </rule>

<rule id="18.17"> $PR × $ID </rule>




<rule id="18.21"> $JL × $JL | $JV | $H2 | $H3 </rule>
<rule id="18.22"> $JV | $H2 × $JV | $JT </rule>
<rule id="18.23"> $JT | $H3 × $JT </rule>

<rule id="18.31"> $JL | $JV | $JT | $H2 | $H3 × $IN </rule>
<rule id="18.32"> $JL | $JV | $JT | $H2 | $H3 × $PO </rule>
<rule id="18.33"> $PR × $JL | $JV | $JT | $H2 | $H3 </rule>

<rule id="19"> $AL × $AL </rule>

<rule id="19.1"> $IS × $AL </rule>

Re:	Linebreak Issues
From:	Mark Davis
Date:	$Date: 2005/10/11 00:40:19 $

Date 10//11/05 L2/05-314

1. Don't allow "-case" to break after the hyphen, leaving a lone hyphen at the end of a line

2. Opening and closing shouldn't break from alphanums

3. The characters ° and % shouldn't break from following alphanumerics

4. URLs

5. Conformance