L2/06-224

Review line-break Feedback
==========================

Date: 2006-06-02

At the last meeting I got the action to review the linebreak feedback
from document L2/06-202. As result I submit this document for the UTC agenda.

Below is my take, marked with ***. The document does make a case for
verifying the LB assignments for the various COMMAs, Periods, and
Semicolons. UTC members who have expertise in Armenian, Arabic, Nko,
Syriac, Ethiopic, Canadian Syllabics, Mongolian and Coptic, should take
a look at the issues listed below and help verify the current assignments.

The document makes the proposal to give two control characters specific
semantics. This should be reviewed by UTC. See below.

A./

-----------------------------------------------------------------------------------

0E2F and 0EAF should both have the BA, break after, linebreak property.

*** They currently have SA, see other SA related comments below. It's
not clear what this buys us. These are not classed as punctuation, but
as letters in Unicode.
---

1A1F;AL # BUGINESE END OF SECTION
It seems strange that an "end of section" has lb prop AL (but I don't know
for sure that it is wrong) instead of BA.

*** This comment seems to be based solely on the name of the character,
and not informed by actual evidence of usage of this mark in Buginese.
There are many punctuation marks that occur at the end of text segments,
but that are conventionally always followed by a new paragraph. In that
case, the line break comes from the paragraph break. Some punctuation
can be used at the end of a text segment, where a line break would be
appropriate, but also elsewhere. In such cases, we give the mark the AL
property and suggest the use of ZWSP if a break opportunity is desired.
If actual evidence is found that suggests that Buginese writers expect a
line breaking opportunity at this mark, a change may be appropriate.
----

Does Tagalog/Hanunoo/Buhid use space between words? No, I still
very much dislike SA (as any different from AL), but again, I like
consistency.

*** I don't know the answer to the question and I can't make out
what the feedback is.
---
None of the combining characters should have gotten the SA property.

*** That's a discussion of model. The current model is to create large
runs of SA characters, and to pass these off to another algorithm (whose
details are not specified) for analysis. This model specifically excludes
the idea that the default algorithm is able to do some minimal processing
of SA scripts. A properly designed algorithm for SA should be able to
handle these characters as if they had property CM, esp. when applied
to non-SA characters.
---
These should have lb EX:
203C;NS # DOUBLE EXCLAMATION MARK
203D;NS # INTERROBANG
2047;NS # DOUBLE QUESTION MARK
2048;NS # QUESTION EXCLAMATION MARK
2049;NS # EXCLAMATION QUESTION MARK

On the other hand NS and EX are very similar, and maybe should be merged.

*** the difference between NS and EX is that a break before an NS is
allowed where there is a space. As far as I recall, the differentiation
between NS and EX, and the difference in treatment of double punctuation
goes back to the JIS X4051 standard. There's definitely no need to merge
these two classes.
---
These should be OP:
00A1;AI # INVERTED EXCLAMATION MARK
00BF;AI # INVERTED QUESTION MARK

*** The class AI is intended to support legacy behavior for a limited
set of characters that appear as 'full width' characters in East Asian
character sets. For all of these the legacy behavior is ID. However, the
non-legacy behavior for 00A1 and 00BF is better represented as OP.
Currently UAX#14 suggests a tailoring. If legacy support for these
characters is deemed not useful, they can be moved to OP in a future
version.

In practice, if these characters are followed by letters, then AL and OP
are effectively indistinguishable. I believe that's the usage in Spanish.
---

Not sure why these have lb EX, instead of IS or PO:
060C;EX # ARABIC COMMA
061B;EX # ARABIC SEMICOLON
061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK
066A;EX # ARABIC PERCENT SIGN
06D4;EX # ARABIC FULL STOP

*** EX is for sentence ending, IS and PO are for numeric punctuation.
The percent sign looks like an oversight - at least I can't recall a
rationale for not treating it as part of a numeric expression. If no
objections are raised, we can consider this for a future version.

But see below.
---
Commas in general have a strange mixture of lb property settings:
002C;IS # COMMA
055D;AL # ARMENIAN COMMA
060C;EX # ARABIC COMMA
07F8;IS # NKO COMMA
1363;AL # ETHIOPIC COMMA
1802;BA # MONGOLIAN COMMA
1808;BA # MONGOLIAN MANCHU COMMA
3001;CL # IDEOGRAPHIC COMMA
FE10;IS # PRESENTATION FORM FOR VERTICAL COMMA
FE11;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
FE50;CL # SMALL COMMA
FE51;ID # SMALL IDEOGRAPHIC COMMA
FF0C;CL # FULLWIDTH COMMA
FF64;CL # HALFWIDTH IDEOGRAPHIC COMMA

*** Having all commas behave the same would be incorrect. Where commas
are used, by default, inside numeric expressions, they would be IS.
Commas usually require a space to break from following letters or
numbers, which is true for CL but not true for EX. Assigning the class
BA to a comma, allows a break after a space *before* the comma, while
this is prevented for EX.   AL treats the comma like a letter (which is
kept together with other letters and numbers and following open parens,
but not some other punctuation) and ID allows a break before or after.

Collating this list and comparing the assignments is a useful sanity
check. I agree that it appears doubtful that all the choices in this
list (and I added 002C for completeness) are 100% accurate. However, any
recommendation for change must be accompanied by evidence of the desired
behavior (or examples where the current assignment produces incorrect
results).

A bit of background: the default for a comma is currently CL. This is a
bit stricter than "Western" line breaking, but required by East Asian
rules. As the design point for the algorithm is to accommodate both,
where possible, using CL as a default appropriate. The exception are
commas that are part of numerical notation, they need to be IS.

However, a point can be made that for scripts that are unlikely to occur
in East Asian context, such 'stricter' behavior is not needed (and AL is
sufficient, where spaces are required to allow a break). On the other
hand, preventing a comma from starting a line is mainly preventing what
are marginal to poor line breaks, even outside the East Asian context,
therefore AL seems to have little to recommend it. BA is only
appropriate where x,y must be allowed to break before the y without a
space *and* where x ,y must also break after the x. (EX does the former,
but not the latter, and CL breaks only in the case x, y). BA is really
more appropriate for hyphen like divider punctuation, not commas.

It would be helpful if experts for the various scripts could:
-verify that NKO comma is in numeric use
-verify that Mongolia commas may appear at the beginning of a line
following a space
-verify that Arabic comma can break from preceding letters without a
space and that they don't require a space following them to break a line
-verify that Armenian and Ethiopic commas behave like letters
-verify that FE51 really needs to be ID

If any of these verifications fails, UTC should re-examine these
assignments for a future version.
---

So do full stops, some are even AL which I find particularly surprising:
002E;IS # FULL STOP
0589;IS # ARMENIAN FULL STOP
06D4;EX # ARABIC FULL STOP
0701;AL # SYRIAC SUPRALINEAR FULL STOP
0702;AL # SYRIAC SUBLINEAR FULL STOP
1362;AL # ETHIOPIC FULL STOP
166E;AL # CANADIAN SYLLABICS FULL STOP
1803;BA # MONGOLIAN FULL STOP
1809;BA # MONGOLIAN MANCHU FULL STOP
2CF9;BA # COPTIC OLD NUBIAN FULL STOP
2CFE;BA # COPTIC FULL STOP
3002;CL # IDEOGRAPHIC FULL STOP
FE12;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
FE52;CL # SMALL FULL STOP
FF0E;CL # FULLWIDTH FULL STOP
FF61;CL # HALFWIDTH IDEOGRAPHIC FULL STOP

*** The situation is the same for periods as for commas.

It would be helpful if experts for the various scripts could
- verify that 0589 is used numerically
- verify that Arabic full stop can break from preceding letters without
space
- verify that Ethiopic and Canadian Syllabics full stops act like letters
- verify that Mongolian and Coptic full stops may appear at the
beginning of a line following a space, and that they don't require a
space following them to break a line.

If any of these verifications fails, UTC should re-examine these
assignments for a future version.
---

And semicolons (but I don't know what reversed semicolon is used for):
003B;IS # SEMICOLON
061B;EX # ARABIC SEMICOLON
1364;AL # ETHIOPIC SEMICOLON
204F;AL # REVERSED SEMICOLON
FE14;IS # PRESENTATION FORM FOR VERTICAL SEMICOLON
FE54;NS # SMALL SEMICOLON
FF1B;NS # FULLWIDTH SEMICOLON

*** Somewhat similar in that review of the issues for comma and period
would tend to point towards the correct assignment for semicolons of the
same scripts. A wrinkle is the NS. I believe that goes back to 4051.

The reversed semicolon needs review - I don't understand it's usage
either, we need someone knowledgeable in its use to give input.

---

Control characters (good that VT got lb prop BK):
NBH (0083) should have the lb value WJ, like WJ and ZWNBSP.
BPH (0082) should have the lb value ZW, like ZWSP.

*** I would recommend against enshrining these in UAX#14 - unless we are
presented with evidence that they are (reasonably widely) used that way
in Unicode context.

However, this points out another issue, that is that the current
formulation does not allow tailoring of control characters. This may be
fine, in the sense that line-breaking terminal emulation data streams
could simply be considered ipso facto a different algorithm, rather than
a tailoring of the Unicode line breaking algorithm. Just like an HTML
and XML parser are really different beasts (although similar) and
governed by different conformance requirements.

It's worth bringing this to UTC attention as an issue to be resolved.
---

NL and BK are the same, so there's no need for two lb values.
So I suggest merging NL and BK to just BK, i.e. let NEL have BK.

*** UTC made the deliberate decision to not make that change when we
first realized this. If others agree that this marginal simplification
would be useful, there's no reason why we couldn't change this in a
future version - the impact on conformance is nil.
---
------------------------------------------------------