L2/03-419

From: Asmus Freytag
To: UTC
Re: Linebreak and EastAsianWidth

Someone pointed out that, embarrassingly, our linebreaking rules allow a
break inside "e.g.", and all similar 'words' that use periods.

To fix that, I propose adding a rule 19b

      IS   x   AL

and to make the corresponding change in the pair tables by setting the
intersection of the IS row and AL column from '_' to '%'.

The characters in IS are all punctuation that can appear in numeric
expressions, such as period, comma, colon, etc. none of which should
separate from a following letter or symbol character.

See,
http://www.unicode.org/reports/tr14/#IS
http://www.unicode.org/reports/tr14/#AL

for the definitions of the character classes or look at LineBreak.txt in
the UCD.

Michel provided a list of differences that MS applies to the line break
properties. I have been analyzing these to see if some of their customizations
really should be part of our published default instead.

Here's my conclusion:

1) The largest difference results from a decision by MS to reduce the class
of characters which are given ambiguous or A in East Asian Width. See
EastAsianWidth.txt in the UCD for a list, or see
http://www.unicode.org/reports/tr11/ for a definition and additional
information.

In line breaking, one of the linebreak classes, AI is ambiguous and its
resolution depends on EAW. AI gets resolved to normal alphabetic/symbol (AL
) whenever the EAW class of the given character gets resolved from A to N
and to ideograph (ID) when the EAW class gets resolved from A to W.

The EAW class assignments were done with a particular legacy environment in
mind, dated about 5-7 years ago, and were focused primarily on issues of
transcoding to legacy characters. In the meantime, the web and evolving
practices in font technology have brought forth a whole new set of issues
relating to font binding and display of text created on a variety of
systems and communicated both via Unicode and legacy character sets.

Evolving practice uses more proportional fonts, and tends to display Latin,
Greek and Cyrillic characters (even in the context of Japanese documents)
as narrow characters (the way they would appear in English, Greek or
Russian). Treating such alphabetic characters as 'ambiguous' doesn't serve
the same purpose as it did in the past and can lead to mistakes in font
bindings, causing a ransom note effect in web documents.

At the same time, it makes sense to consistently treat picture like symbols
like ideographs when in East Asian texts, so treating them as ambiguous for
the purpose of line breaking and also font selection etc. will improve the
user experience. Differentiating between symbols that are in specific
legacy sets and those that are not, does not make sense from this perspective.

There are three options:

1) change the EAW assignments
2) change the LB assignments (decouple from EAW)
3) stabilize the EAW assignments and add a new property

The first option would be a major instability. Implementations for which
the current EAW assignments produce acceptable results would fail with the
new set and vice versa.
The second option would improve the line breaking behavior, at the expense
of limiting the applicability of the revised classification to line breaking.
The third choice maintains the current structure of an underlying EAW-like
classifications that other specifications (like LB) can build upon. It
would require defining a new property. The classes themselves, not only
their division of the code space, would likely be different and focused on
their different purpose.

Both two and three are reasonable choices.


2.a. Recommended changes
----------------------

2.a.1 Change all double wide combining marks from CM to GL

035D COMBINING DOUBLE BREVE
035E
035F
0360
0361
0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW

Double combining marks graphically apply to both the preceding and the
following character. Making their linebreak class GL prevents breaks if
they are applied to non-alphabetic characters. This is a reasonable
approach to this particular edge case. Practical occurrence of this
situation is low; it should not materially affect existing implementations.

2.a.2 Make canonical equivalents equal

0373 GREEK QUESTION MARK -> treat like ';', change AL->IS
2126 OHM -> treat like Omega, change PO->AL


2.a.3 Treat Circled letters and digits consistently

This is an oversight in our 4.0.0 data file. There's a range of circled
letters and digit 0 that's different from the rest of the set. And the
circled digits in the 2700 block are treated different from the ones in the
2450 block.

Change EAW from N to A
Change LB from AL to AI

These are an oversight in our file
24C0..24CF  CIRCLED LATIN CAPITAL LETTER K..LETTER Z
24EA  CIRCLED DIGIT ZERO
2776..2793  DINGBAT NEGATIVE CIRCLED DIGIT ONE..CIRCLED SANS-SERIF NUMBER TEN

2.a.4 Fix

2140 N-ARY DOUBLE STRUCK SUMMATION

this needs to become AL (and EAW N). [If this and the preceding aren't
already fixed for 4.0.1 - they look familiar.]

2.b Other changes worth considering
-----------------------------------

2.b.1 Change Arabic separators from AL to NU

066B;NU # ARABIC DECIMAL SEPARATOR
066C;NU # ARABIC THOUSANDS SEPARATOR

the effect of this change would be very slight, as the treatment of AL and
NU differs only wrt to some of the numerical punctuation, and there only
slightly. If these characters are always used surrounded by digits, it
makes no difference whether they are AL or NU; if they were ever at the end
of a number, then there could be some noticeable effects. However, setting
their linebreak class to NU would make it easier to collect a the entire
run corresponding to a number, based on the regular expression

     PR ? ( OP | HY ) ? NU (NU | IS) * CL ?  PO ?

given in UAX#14

2.b.2 Change Bullet from AL to IS

2022 BULLET

Making this change prevents breaks before a bullet, but allows a break
between a bullet and a following alphabetic character or symbol. It's not
clear what the practical impact of this is, since bullets most often don't
seem to occur in the middle of a text line. If they do occur, breaks before
a bullet could be jarring on a reader of East Asian text.

When they are used as a bullet marker, they would appear after a hard line
break. That use would be unaffectd by this change.

2.b.3 Creating two new LB classes to better treat quotes

The ambiguous nature of quotation marks as to whether they are opening or
closing is not present when they are used in East Asian context. Our
current approach of preventing any breaks around these ambiguous quotes
would matter more in EA contexts, where such caution is not needed.

By introducing two new LB classes, QO and QC such quotes can be resolved to
QU  when they are narrow characters and either OP or CL respectively when
used as wide characters.

QO: OP if wide, QU if narrow
2018;QO # LEFT SINGLE QUOTATION MARK
201C;QO # LEFT DOUBLE QUOTATION MARK
275B;QO # HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
275D;QO # HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT

QL: CL if wide, QU if narrow
201D;QL # RIGHT DOUBLE QUOTATION MARK
275C;QL # HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
275E;QL # HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT

note: 2019 remains QU, due to its additional role as apostrophe.

Impact on existing implementations: by resolving QC -> QU and QO -> QU
unconditionally, existing implementations can continue to produce the same
results as before.

For the definitions of the existing LB classes, see,
http://www.unicode.org/reports/tr14/#QU
http://www.unicode.org/reports/tr14/#OP
http://www.unicode.org/reports/tr14/#CL

2.b.4 Change sub/superscript punctuation from OP/CL to AL

207D SUPERSCRIPT LEFT PARENTHESIS
207E SUPERSCRIPT RIGHT PARENTHESIS
208D SUBSCRIPT LEFT PARENTHESIS
208E SUBSCRIPT RIGHT PARENTHESIS

This will help keep the entire super/subscript expression together with the
anchoring digit or letter. Line-breaking within such an expression is
outside the scope of Unicode's default linebreaking algorithm.

We already don't give the other operators (+ and -) the same classes as for
regular math operators.

It's not an overwhelming issue, but then, it would not have a high
visibility impact on existing implementations. Its main effect will be to
eliminate a 'defect' in the eyes of a certain class of potential adopters.

2.b.4 Make some change for EM Dash

EM Dash is treated differently in Western and Eastern typography. The
current behavior is not ideal.

One possibility is to change it from class B2 to class IN for EM-Dash

2014 EM DASH

The differences between the classes are not big. B2 currently only has the
EM-Dash in it. It reflects the fact that EM-Dashes, at least in Western
typography, can occur both at the end and at the start of a line, even when
no space separates them from their neighboring character.

Class IN is defined so that it cannot break from a preceding word or number
(AL, ID, NU), unless there is a space.

However, making the change as proposed leads to incorrect line break rules
for the EM-Dash in Western typography. [The fact that a break after and
EM-Dash is preferred in some situations, must be handled by a secondary
analysis of available break opportunities.]

Another possibility: B2 as currently defined, could be improved. If
evidence can be brought that East Asian text do strictly require the
EM-Dash to remain on the line, then B2 could be redefined so as to not
break after ideographs.

B2 also currently completely disallows a break in any series of Em-dashes
alternating with spaces. That seems overly restrictive and could be
replaced by preventing only breaks between directly adjacent Em-dashes.

2.b.5 Change Hangul from ID to HG

Hangul need to be tailored to work like AL or ID based on layout mode in
Korean. Currently we are assigning ID to all of them by default and suggest
that people override this as needed. The problem is that this pushes the
definition of which ranges are affected onto the implementations. A better
choice is to give the affected characters the class HG, which means an
implementation only needs to tailor what HG means.

Ranges affected:
1100-11FF   Jamo
3131-318F   compat jamo
3200-3212   circled/parenth hangul
3260-327F   circled/parenth hangul
AC00-D7A3   Hangul syllables

FFA0   HALFWIDTH HANGUL FILLER
FFA1..FFBE   HALFWIDTH HANGUL LETTER KIYEOK..HIEUH
FFC2..FFC7   HALFWIDTH HANGUL LETTER A..E
FFCA..FFCF   HALFWIDTH HANGUL LETTER YEO..OE
FFD2..FFD7   HALFWIDTH HANGUL LETTER YO..YU
FFDA..FFDC   HALFWIDTH HANGUL LETTER EU..I


2.b.6 Change halfwidth katakana from AL to ID (except SMALL)

FF66   HALFWIDTH KATAKANA LETTER WO
FF71..FF9D   HALFWIDTH KATAKANA LETTER A..N
FFE8   HALFWIDTH FORMS LIGHT VERTICAL
FFE9   HALFWIDTH LEFTWARDS ARROW
FFEA   HALFWIDTH UPWARDS ARROW
FFEB   HALFWIDTH RIGHTWARDS ARROW
FFEC   HALFWIDTH DOWNWARDS ARROW
FFED   HALFWIDTH BLACK SQUARE
FFEE   HALFWIDTH WHITE CIRCLE

2.b.7 LB class changes based on different EAW

Change many letters from AI to AL

This change would remove AI status from all characters in these blocks,
except as noted, plus the list of characters at the end. Of these changes,
Unicode might want to pick up the change in treatment of the alphabetic
characters. If these are treated as 'narrow' by default in modern systems,
moving them to AL from AI would streamline implementations. For the
symbols, esp. compatibility symbols like Box drawings, the case is less
clear cut. My opinion is that box drawings don't matter, but if there is a
use-scenario where they cause a problem as is, we could equally well change
them.

The proposed changes for the other symbols seem random. I'm especially
concerned with

List of affected blocks and characters.

Latin-1
   except:
   00D7  MULTIPLICATION SIGN
   00F7  DIVISION SIGN
Latin Extended-A
Latin Extended-B
IPA Extensions
Modifier Letters
   except:
   02C9 MODIFIER LETTER MACRON
   02CA MODIFIER LETTER ACUTE ACCENT
   02CB MODIFIER LETTER GRAVE ACCENT
   02CD MODIFIER LETTER LOW MACRON
   02D8..02DB BREVE..OGONEK
   02DD DOUBLE ACCUTE ACCENT

Greek and Coptic
General Punctuation
   except:
   2015 HORIZONTAL BAR
   2020 DAGGER
   2021 DOUBLE DAGGER
   203B REFERENCE MARK
Superscripts & Subscripts
Letterlike Symbols
   Comments:
   That 2140  DOUBLE-STRUCK N-ARY SUMMATION was ever AI is a mistake in our
data file
NumberForms
   except:
   2160..216B ROMAN NUMERAL ONE..TWELVE
   2170..2179 SMALL ROMAN NUMERAL ONE..TEN

Block Elements
Geometric Shapes
   except:
   25A0   BLACK SQUARE
   25A1    WHITE SQUARE
   25C6   BLACK DIAMOND
   25C7   WHITE DIAMOND
   25CB   WHITE CIRCLE
   25CE   BULLSEYE
   25CF   BLACK CIRCLE
   25EF   LARGE CIRCLE

Further map AI -> AL

Some of the arrows
2194  LEFT RIGHT ARROW
2195  UP DOWN ARROW
2196  NORTH WEST ARROW
2197  NORTH EAST ARROW
2198  SOUTH EAST ARROW
2199  SOUTH WEST ARROW

some of the math symbols
220F  N-ARY PRODUCT
2215  DIVISION SLASH
2223  DIVIDES
2236  RATIO
2237  PROPORTION
223C  TILDE OPERATOR
2248  ALMOST EQUAL TO
224C  ALL EQUAL TO
2264  LESS-THAN OR EQUAL TO
2265  GREATER-THAN OR EQUAL TO
226E  NOT LESS-THAN
226F  NOT GREATER-THAN
2295  CIRCLED PLUS
2299  CIRCLED DOT OPERATOR

some of the symbols
2616  WHITE SHOGI PIECE
2617  BLACK SHOGI PIECE
2660  BLACK SPADE SUIT
2661  WHITE HEART SUIT
2663  BLACK CLUB SUIT
2664  WHITE SPADE SUIT
2665  BLACK HEART SUIT
2667  WHITE CLUB SUIT
2669  QUARTER NOTE
266C  BEAMED SIXTEENTH NOTES

as well as a collection of box drawings:

2504  BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL
2505  BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL
2506  BOX DRAWINGS LIGHT TRIPLE DASH VERTICAL
2507  BOX DRAWINGS HEAVY TRIPLE DASH VERTICAL
2508  BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL
2509  BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL
250A  BOX DRAWINGS LIGHT QUADRUPLE DASH VERTICAL
250B  BOX DRAWINGS HEAVY QUADRUPLE DASH VERTICAL
250D  BOX DRAWINGS DOWN LIGHT AND RIGHT HEAVY
250E  BOX DRAWINGS DOWN HEAVY AND RIGHT LIGHT
2511  BOX DRAWINGS DOWN LIGHT AND LEFT HEAVY
2512  BOX DRAWINGS DOWN HEAVY AND LEFT LIGHT
2515  BOX DRAWINGS UP LIGHT AND RIGHT HEAVY
2516  BOX DRAWINGS UP HEAVY AND RIGHT LIGHT
2519  BOX DRAWINGS UP LIGHT AND LEFT HEAVY
251A  BOX DRAWINGS UP HEAVY AND LEFT LIGHT
251E  BOX DRAWINGS UP HEAVY AND RIGHT DOWN LIGHT
251F  BOX DRAWINGS DOWN HEAVY AND RIGHT UP LIGHT
2521  BOX DRAWINGS DOWN LIGHT AND RIGHT UP HEAVY
2522  BOX DRAWINGS UP LIGHT AND RIGHT DOWN HEAVY
2526  BOX DRAWINGS UP HEAVY AND LEFT DOWN LIGHT
2527  BOX DRAWINGS DOWN HEAVY AND LEFT UP LIGHT
2529  BOX DRAWINGS DOWN LIGHT AND LEFT UP HEAVY
252A  BOX DRAWINGS UP LIGHT AND LEFT DOWN HEAVY
252D  BOX DRAWINGS LEFT HEAVY AND RIGHT DOWN LIGHT
252E  BOX DRAWINGS RIGHT HEAVY AND LEFT DOWN LIGHT
2531  BOX DRAWINGS RIGHT LIGHT AND LEFT DOWN HEAVY
2532  BOX DRAWINGS LEFT LIGHT AND RIGHT DOWN HEAVY
2535  BOX DRAWINGS LEFT HEAVY AND RIGHT UP LIGHT
2536  BOX DRAWINGS RIGHT HEAVY AND LEFT UP LIGHT
2539  BOX DRAWINGS RIGHT LIGHT AND LEFT UP HEAVY
253A  BOX DRAWINGS LEFT LIGHT AND RIGHT UP HEAVY
253D  BOX DRAWINGS LEFT HEAVY AND RIGHT VERTICAL LIGHT
253E  BOX DRAWINGS RIGHT HEAVY AND LEFT VERTICAL LIGHT
2540..254A  BOX DRAWINGS UP HEAVY AND DOWN HORIZONTAL LIGHT..LEFT LIGHT AND
RIGHT VERTICAL HEAVY
2550..2574  BOX DRAWINGS DOUBLE HORIZONTAL..LIGHT LEFT

Change symbols from AL to AI

Zodiac
2641  EARTH
2643  JUPITER
2644  SATURN
2645  URANUS
2646  NEPTUNE
2647  PLUTO
2648  ARIES
2649  TAURUS
264A  GEMINI
264B  CANCER
264C  LEO
264D  VIRGO
264E  LIBRA
264F  SCORPIUS
2650  SAGITTARIUS
2651  CAPRICORN
2652  AQUARIUS
2653  PISCES

Chess
2654  WHITE CHESS KING
2655  WHITE CHESS QUEEN
2656  WHITE CHESS ROOK
2657  WHITE CHESS BISHOP
2658  WHITE CHESS KNIGHT
2659  WHITE CHESS PAWN
265A  BLACK CHESS KING
265B  BLACK CHESS QUEEN
265C  BLACK CHESS ROOK
265D  BLACK CHESS BISHOP
265E  BLACK CHESS KNIGHT
265F  BLACK CHESS PAWN

Dingbats
The entire 2700 block

2.c Changes that are not recommended
------------------------------------

2.c.1 Change class BK, CR, LF, and NL to class BA

000A LF
000C VT
000D CR
2028 LS
2029 PS

This would have the results of making a CR, LF not break a line, but to
disallow a line break, e.g. in front of closing paren (CL). This must be
understood as an implementation internal hack.

2.c.2 Changes class HY to BA

002D HYPHEN-MINUS

There is a one rule difference between HY and BA, an that is

   HY x NU

which prevents '-3' from breaking. This was considered at UTC, but it is a
legitimate tailoring.

2.c.3 Change class BB to BA for todo hyphen

180E TODO SOFT HYPHEN

this character is described in section 12.2 of Unicode 4.0 as going onto
the next line. It is incorrect to make it a BA

2.c.4 Change class BB to AL for some marks

UAX#14 captures the use of these marks in dictionaries and similar
instances. There have been no complaints about that.

02C8  MODIFIER LETTER VERTICAL LINE
02CC  MODIFIER LETTER LOW VERTICAL LINE

2.c.5 Change characters to class NS

BA -> NS

2027 HYPHENATION POINT

CL -> NS

3002 IDEOGRARPHIC FULL STOP
FF0E FULLWIDTH FULL STOP
FE52 SMALL FULL STOP
FF61 HALFWIDTH IDEOGRAPHIC FULL STOP

EX -> NS

FE56;NS # SMALL QUESTION MARK
FE57;NS # SMALL EXCLAMATION MARK
FF01;NS # FULLWIDTH EXCLAMATION MARK
FF1F;NS # FULLWIDTH QUESTION MARK

2.c.6 Change small comma
ID -> CL
FE51 SMALL IDEOGRAPHIC COMMA


2.d Tailorings that must remain private
---------------------------------------
Any tailoring of surrogate code points or private use characters must
remain outside the scope of the defaults established by UTC. The same goes
for tailoring FFFC from contingent break (CB) to any of the other LB classes.