L2/03-419 From: Asmus Freytag To: UTC Re: Linebreak and EastAsianWidth Someone pointed out that, embarrassingly, our linebreaking rules allow a break inside "e.g.", and all similar 'words' that use periods. To fix that, I propose adding a rule 19b IS x AL and to make the corresponding change in the pair tables by setting the intersection of the IS row and AL column from '_' to '%'. The characters in IS are all punctuation that can appear in numeric expressions, such as period, comma, colon, etc. none of which should separate from a following letter or symbol character. See, http://www.unicode.org/reports/tr14/#IS http://www.unicode.org/reports/tr14/#AL for the definitions of the character classes or look at LineBreak.txt in the UCD. Michel provided a list of differences that MS applies to the line break properties. I have been analyzing these to see if some of their customizations really should be part of our published default instead. Here's my conclusion: 1) The largest difference results from a decision by MS to reduce the class of characters which are given ambiguous or A in East Asian Width. See EastAsianWidth.txt in the UCD for a list, or see http://www.unicode.org/reports/tr11/ for a definition and additional information. In line breaking, one of the linebreak classes, AI is ambiguous and its resolution depends on EAW. AI gets resolved to normal alphabetic/symbol (AL ) whenever the EAW class of the given character gets resolved from A to N and to ideograph (ID) when the EAW class gets resolved from A to W. The EAW class assignments were done with a particular legacy environment in mind, dated about 5-7 years ago, and were focused primarily on issues of transcoding to legacy characters. In the meantime, the web and evolving practices in font technology have brought forth a whole new set of issues relating to font binding and display of text created on a variety of systems and communicated both via Unicode and legacy character sets. Evolving practice uses more proportional fonts, and tends to display Latin, Greek and Cyrillic characters (even in the context of Japanese documents) as narrow characters (the way they would appear in English, Greek or Russian). Treating such alphabetic characters as 'ambiguous' doesn't serve the same purpose as it did in the past and can lead to mistakes in font bindings, causing a ransom note effect in web documents. At the same time, it makes sense to consistently treat picture like symbols like ideographs when in East Asian texts, so treating them as ambiguous for the purpose of line breaking and also font selection etc. will improve the user experience. Differentiating between symbols that are in specific legacy sets and those that are not, does not make sense from this perspective. There are three options: 1) change the EAW assignments 2) change the LB assignments (decouple from EAW) 3) stabilize the EAW assignments and add a new property The first option would be a major instability. Implementations for which the current EAW assignments produce acceptable results would fail with the new set and vice versa. The second option would improve the line breaking behavior, at the expense of limiting the applicability of the revised classification to line breaking. The third choice maintains the current structure of an underlying EAW-like classifications that other specifications (like LB) can build upon. It would require defining a new property. The classes themselves, not only their division of the code space, would likely be different and focused on their different purpose. Both two and three are reasonable choices. 2.a. Recommended changes ---------------------- 2.a.1 Change all double wide combining marks from CM to GL 035D COMBINING DOUBLE BREVE 035E 035F 0360 0361 0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW Double combining marks graphically apply to both the preceding and the following character. Making their linebreak class GL prevents breaks if they are applied to non-alphabetic characters. This is a reasonable approach to this particular edge case. Practical occurrence of this situation is low; it should not materially affect existing implementations. 2.a.2 Make canonical equivalents equal 0373 GREEK QUESTION MARK -> treat like ';', change AL->IS 2126 OHM -> treat like Omega, change PO->AL 2.a.3 Treat Circled letters and digits consistently This is an oversight in our 4.0.0 data file. There's a range of circled letters and digit 0 that's different from the rest of the set. And the circled digits in the 2700 block are treated different from the ones in the 2450 block. Change EAW from N to A Change LB from AL to AI These are an oversight in our file 24C0..24CF CIRCLED LATIN CAPITAL LETTER K..LETTER Z 24EA CIRCLED DIGIT ZERO 2776..2793 DINGBAT NEGATIVE CIRCLED DIGIT ONE..CIRCLED SANS-SERIF NUMBER TEN 2.a.4 Fix 2140 N-ARY DOUBLE STRUCK SUMMATION this needs to become AL (and EAW N). [If this and the preceding aren't already fixed for 4.0.1 - they look familiar.] 2.b Other changes worth considering ----------------------------------- 2.b.1 Change Arabic separators from AL to NU 066B;NU # ARABIC DECIMAL SEPARATOR 066C;NU # ARABIC THOUSANDS SEPARATOR the effect of this change would be very slight, as the treatment of AL and NU differs only wrt to some of the numerical punctuation, and there only slightly. If these characters are always used surrounded by digits, it makes no difference whether they are AL or NU; if they were ever at the end of a number, then there could be some noticeable effects. However, setting their linebreak class to NU would make it easier to collect a the entire run corresponding to a number, based on the regular expression PR ? ( OP | HY ) ? NU (NU | IS) * CL ? PO ? given in UAX#14 2.b.2 Change Bullet from AL to IS 2022 BULLET Making this change prevents breaks before a bullet, but allows a break between a bullet and a following alphabetic character or symbol. It's not clear what the practical impact of this is, since bullets most often don't seem to occur in the middle of a text line. If they do occur, breaks before a bullet could be jarring on a reader of East Asian text. When they are used as a bullet marker, they would appear after a hard line break. That use would be unaffectd by this change. 2.b.3 Creating two new LB classes to better treat quotes The ambiguous nature of quotation marks as to whether they are opening or closing is not present when they are used in East Asian context. Our current approach of preventing any breaks around these ambiguous quotes would matter more in EA contexts, where such caution is not needed. By introducing two new LB classes, QO and QC such quotes can be resolved to QU when they are narrow characters and either OP or CL respectively when used as wide characters. QO: OP if wide, QU if narrow 2018;QO # LEFT SINGLE QUOTATION MARK 201C;QO # LEFT DOUBLE QUOTATION MARK 275B;QO # HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT 275D;QO # HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT QL: CL if wide, QU if narrow 201D;QL # RIGHT DOUBLE QUOTATION MARK 275C;QL # HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT 275E;QL # HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT note: 2019 remains QU, due to its additional role as apostrophe. Impact on existing implementations: by resolving QC -> QU and QO -> QU unconditionally, existing implementations can continue to produce the same results as before. For the definitions of the existing LB classes, see, http://www.unicode.org/reports/tr14/#QU http://www.unicode.org/reports/tr14/#OP http://www.unicode.org/reports/tr14/#CL 2.b.4 Change sub/superscript punctuation from OP/CL to AL 207D SUPERSCRIPT LEFT PARENTHESIS 207E SUPERSCRIPT RIGHT PARENTHESIS 208D SUBSCRIPT LEFT PARENTHESIS 208E SUBSCRIPT RIGHT PARENTHESIS This will help keep the entire super/subscript expression together with the anchoring digit or letter. Line-breaking within such an expression is outside the scope of Unicode's default linebreaking algorithm. We already don't give the other operators (+ and -) the same classes as for regular math operators. It's not an overwhelming issue, but then, it would not have a high visibility impact on existing implementations. Its main effect will be to eliminate a 'defect' in the eyes of a certain class of potential adopters. 2.b.4 Make some change for EM Dash EM Dash is treated differently in Western and Eastern typography. The current behavior is not ideal. One possibility is to change it from class B2 to class IN for EM-Dash 2014 EM DASH The differences between the classes are not big. B2 currently only has the EM-Dash in it. It reflects the fact that EM-Dashes, at least in Western typography, can occur both at the end and at the start of a line, even when no space separates them from their neighboring character. Class IN is defined so that it cannot break from a preceding word or number (AL, ID, NU), unless there is a space. However, making the change as proposed leads to incorrect line break rules for the EM-Dash in Western typography. [The fact that a break after and EM-Dash is preferred in some situations, must be handled by a secondary analysis of available break opportunities.] Another possibility: B2 as currently defined, could be improved. If evidence can be brought that East Asian text do strictly require the EM-Dash to remain on the line, then B2 could be redefined so as to not break after ideographs. B2 also currently completely disallows a break in any series of Em-dashes alternating with spaces. That seems overly restrictive and could be replaced by preventing only breaks between directly adjacent Em-dashes. 2.b.5 Change Hangul from ID to HG Hangul need to be tailored to work like AL or ID based on layout mode in Korean. Currently we are assigning ID to all of them by default and suggest that people override this as needed. The problem is that this pushes the definition of which ranges are affected onto the implementations. A better choice is to give the affected characters the class HG, which means an implementation only needs to tailor what HG means. Ranges affected: 1100-11FF Jamo 3131-318F compat jamo 3200-3212 circled/parenth hangul 3260-327F circled/parenth hangul AC00-D7A3 Hangul syllables FFA0 HALFWIDTH HANGUL FILLER FFA1..FFBE HALFWIDTH HANGUL LETTER KIYEOK..HIEUH FFC2..FFC7 HALFWIDTH HANGUL LETTER A..E FFCA..FFCF HALFWIDTH HANGUL LETTER YEO..OE FFD2..FFD7 HALFWIDTH HANGUL LETTER YO..YU FFDA..FFDC HALFWIDTH HANGUL LETTER EU..I 2.b.6 Change halfwidth katakana from AL to ID (except SMALL) FF66 HALFWIDTH KATAKANA LETTER WO FF71..FF9D HALFWIDTH KATAKANA LETTER A..N FFE8 HALFWIDTH FORMS LIGHT VERTICAL FFE9 HALFWIDTH LEFTWARDS ARROW FFEA HALFWIDTH UPWARDS ARROW FFEB HALFWIDTH RIGHTWARDS ARROW FFEC HALFWIDTH DOWNWARDS ARROW FFED HALFWIDTH BLACK SQUARE FFEE HALFWIDTH WHITE CIRCLE 2.b.7 LB class changes based on different EAW Change many letters from AI to AL This change would remove AI status from all characters in these blocks, except as noted, plus the list of characters at the end. Of these changes, Unicode might want to pick up the change in treatment of the alphabetic characters. If these are treated as 'narrow' by default in modern systems, moving them to AL from AI would streamline implementations. For the symbols, esp. compatibility symbols like Box drawings, the case is less clear cut. My opinion is that box drawings don't matter, but if there is a use-scenario where they cause a problem as is, we could equally well change them. The proposed changes for the other symbols seem random. I'm especially concerned with List of affected blocks and characters. Latin-1 except: 00D7 MULTIPLICATION SIGN 00F7 DIVISION SIGN Latin Extended-A Latin Extended-B IPA Extensions Modifier Letters except: 02C9 MODIFIER LETTER MACRON 02CA MODIFIER LETTER ACUTE ACCENT 02CB MODIFIER LETTER GRAVE ACCENT 02CD MODIFIER LETTER LOW MACRON 02D8..02DB BREVE..OGONEK 02DD DOUBLE ACCUTE ACCENT Greek and Coptic General Punctuation except: 2015 HORIZONTAL BAR 2020 DAGGER 2021 DOUBLE DAGGER 203B REFERENCE MARK Superscripts & Subscripts Letterlike Symbols Comments: That 2140 DOUBLE-STRUCK N-ARY SUMMATION was ever AI is a mistake in our data file NumberForms except: 2160..216B ROMAN NUMERAL ONE..TWELVE 2170..2179 SMALL ROMAN NUMERAL ONE..TEN Block Elements Geometric Shapes except: 25A0 BLACK SQUARE 25A1 WHITE SQUARE 25C6 BLACK DIAMOND 25C7 WHITE DIAMOND 25CB WHITE CIRCLE 25CE BULLSEYE 25CF BLACK CIRCLE 25EF LARGE CIRCLE Further map AI -> AL Some of the arrows 2194 LEFT RIGHT ARROW 2195 UP DOWN ARROW 2196 NORTH WEST ARROW 2197 NORTH EAST ARROW 2198 SOUTH EAST ARROW 2199 SOUTH WEST ARROW some of the math symbols 220F N-ARY PRODUCT 2215 DIVISION SLASH 2223 DIVIDES 2236 RATIO 2237 PROPORTION 223C TILDE OPERATOR 2248 ALMOST EQUAL TO 224C ALL EQUAL TO 2264 LESS-THAN OR EQUAL TO 2265 GREATER-THAN OR EQUAL TO 226E NOT LESS-THAN 226F NOT GREATER-THAN 2295 CIRCLED PLUS 2299 CIRCLED DOT OPERATOR some of the symbols 2616 WHITE SHOGI PIECE 2617 BLACK SHOGI PIECE 2660 BLACK SPADE SUIT 2661 WHITE HEART SUIT 2663 BLACK CLUB SUIT 2664 WHITE SPADE SUIT 2665 BLACK HEART SUIT 2667 WHITE CLUB SUIT 2669 QUARTER NOTE 266C BEAMED SIXTEENTH NOTES as well as a collection of box drawings: 2504 BOX DRAWINGS LIGHT TRIPLE DASH HORIZONTAL 2505 BOX DRAWINGS HEAVY TRIPLE DASH HORIZONTAL 2506 BOX DRAWINGS LIGHT TRIPLE DASH VERTICAL 2507 BOX DRAWINGS HEAVY TRIPLE DASH VERTICAL 2508 BOX DRAWINGS LIGHT QUADRUPLE DASH HORIZONTAL 2509 BOX DRAWINGS HEAVY QUADRUPLE DASH HORIZONTAL 250A BOX DRAWINGS LIGHT QUADRUPLE DASH VERTICAL 250B BOX DRAWINGS HEAVY QUADRUPLE DASH VERTICAL 250D BOX DRAWINGS DOWN LIGHT AND RIGHT HEAVY 250E BOX DRAWINGS DOWN HEAVY AND RIGHT LIGHT 2511 BOX DRAWINGS DOWN LIGHT AND LEFT HEAVY 2512 BOX DRAWINGS DOWN HEAVY AND LEFT LIGHT 2515 BOX DRAWINGS UP LIGHT AND RIGHT HEAVY 2516 BOX DRAWINGS UP HEAVY AND RIGHT LIGHT 2519 BOX DRAWINGS UP LIGHT AND LEFT HEAVY 251A BOX DRAWINGS UP HEAVY AND LEFT LIGHT 251E BOX DRAWINGS UP HEAVY AND RIGHT DOWN LIGHT 251F BOX DRAWINGS DOWN HEAVY AND RIGHT UP LIGHT 2521 BOX DRAWINGS DOWN LIGHT AND RIGHT UP HEAVY 2522 BOX DRAWINGS UP LIGHT AND RIGHT DOWN HEAVY 2526 BOX DRAWINGS UP HEAVY AND LEFT DOWN LIGHT 2527 BOX DRAWINGS DOWN HEAVY AND LEFT UP LIGHT 2529 BOX DRAWINGS DOWN LIGHT AND LEFT UP HEAVY 252A BOX DRAWINGS UP LIGHT AND LEFT DOWN HEAVY 252D BOX DRAWINGS LEFT HEAVY AND RIGHT DOWN LIGHT 252E BOX DRAWINGS RIGHT HEAVY AND LEFT DOWN LIGHT 2531 BOX DRAWINGS RIGHT LIGHT AND LEFT DOWN HEAVY 2532 BOX DRAWINGS LEFT LIGHT AND RIGHT DOWN HEAVY 2535 BOX DRAWINGS LEFT HEAVY AND RIGHT UP LIGHT 2536 BOX DRAWINGS RIGHT HEAVY AND LEFT UP LIGHT 2539 BOX DRAWINGS RIGHT LIGHT AND LEFT UP HEAVY 253A BOX DRAWINGS LEFT LIGHT AND RIGHT UP HEAVY 253D BOX DRAWINGS LEFT HEAVY AND RIGHT VERTICAL LIGHT 253E BOX DRAWINGS RIGHT HEAVY AND LEFT VERTICAL LIGHT 2540..254A BOX DRAWINGS UP HEAVY AND DOWN HORIZONTAL LIGHT..LEFT LIGHT AND RIGHT VERTICAL HEAVY 2550..2574 BOX DRAWINGS DOUBLE HORIZONTAL..LIGHT LEFT Change symbols from AL to AI Zodiac 2641 EARTH 2643 JUPITER 2644 SATURN 2645 URANUS 2646 NEPTUNE 2647 PLUTO 2648 ARIES 2649 TAURUS 264A GEMINI 264B CANCER 264C LEO 264D VIRGO 264E LIBRA 264F SCORPIUS 2650 SAGITTARIUS 2651 CAPRICORN 2652 AQUARIUS 2653 PISCES Chess 2654 WHITE CHESS KING 2655 WHITE CHESS QUEEN 2656 WHITE CHESS ROOK 2657 WHITE CHESS BISHOP 2658 WHITE CHESS KNIGHT 2659 WHITE CHESS PAWN 265A BLACK CHESS KING 265B BLACK CHESS QUEEN 265C BLACK CHESS ROOK 265D BLACK CHESS BISHOP 265E BLACK CHESS KNIGHT 265F BLACK CHESS PAWN Dingbats The entire 2700 block 2.c Changes that are not recommended ------------------------------------ 2.c.1 Change class BK, CR, LF, and NL to class BA 000A LF 000C VT 000D CR 2028 LS 2029 PS This would have the results of making a CR, LF not break a line, but to disallow a line break, e.g. in front of closing paren (CL). This must be understood as an implementation internal hack. 2.c.2 Changes class HY to BA 002D HYPHEN-MINUS There is a one rule difference between HY and BA, an that is HY x NU which prevents '-3' from breaking. This was considered at UTC, but it is a legitimate tailoring. 2.c.3 Change class BB to BA for todo hyphen 180E TODO SOFT HYPHEN this character is described in section 12.2 of Unicode 4.0 as going onto the next line. It is incorrect to make it a BA 2.c.4 Change class BB to AL for some marks UAX#14 captures the use of these marks in dictionaries and similar instances. There have been no complaints about that. 02C8 MODIFIER LETTER VERTICAL LINE 02CC MODIFIER LETTER LOW VERTICAL LINE 2.c.5 Change characters to class NS BA -> NS 2027 HYPHENATION POINT CL -> NS 3002 IDEOGRARPHIC FULL STOP FF0E FULLWIDTH FULL STOP FE52 SMALL FULL STOP FF61 HALFWIDTH IDEOGRAPHIC FULL STOP EX -> NS FE56;NS # SMALL QUESTION MARK FE57;NS # SMALL EXCLAMATION MARK FF01;NS # FULLWIDTH EXCLAMATION MARK FF1F;NS # FULLWIDTH QUESTION MARK 2.c.6 Change small comma ID -> CL FE51 SMALL IDEOGRAPHIC COMMA 2.d Tailorings that must remain private --------------------------------------- Any tailoring of surrogate code points or private use characters must remain outside the scope of the defaults established by UTC. The same goes for tailoring FFFC from contingent break (CB) to any of the other LB classes.