L2/06-224 Review line-break Feedback ========================== Date: 2006-06-02 At the last meeting I got the action to review the linebreak feedback from document L2/06-202. As result I submit this document for the UTC agenda. Below is my take, marked with ***. The document does make a case for verifying the LB assignments for the various COMMAs, Periods, and Semicolons. UTC members who have expertise in Armenian, Arabic, Nko, Syriac, Ethiopic, Canadian Syllabics, Mongolian and Coptic, should take a look at the issues listed below and help verify the current assignments. The document makes the proposal to give two control characters specific semantics. This should be reviewed by UTC. See below. A./ ----------------------------------------------------------------------------------- 0E2F and 0EAF should both have the BA, break after, linebreak property. *** They currently have SA, see other SA related comments below. It's not clear what this buys us. These are not classed as punctuation, but as letters in Unicode. --- 1A1F;AL # BUGINESE END OF SECTION It seems strange that an "end of section" has lb prop AL (but I don't know for sure that it is wrong) instead of BA. *** This comment seems to be based solely on the name of the character, and not informed by actual evidence of usage of this mark in Buginese. There are many punctuation marks that occur at the end of text segments, but that are conventionally always followed by a new paragraph. In that case, the line break comes from the paragraph break. Some punctuation can be used at the end of a text segment, where a line break would be appropriate, but also elsewhere. In such cases, we give the mark the AL property and suggest the use of ZWSP if a break opportunity is desired. If actual evidence is found that suggests that Buginese writers expect a line breaking opportunity at this mark, a change may be appropriate. ---- Does Tagalog/Hanunoo/Buhid use space between words? No, I still very much dislike SA (as any different from AL), but again, I like consistency. *** I don't know the answer to the question and I can't make out what the feedback is. --- None of the combining characters should have gotten the SA property. *** That's a discussion of model. The current model is to create large runs of SA characters, and to pass these off to another algorithm (whose details are not specified) for analysis. This model specifically excludes the idea that the default algorithm is able to do some minimal processing of SA scripts. A properly designed algorithm for SA should be able to handle these characters as if they had property CM, esp. when applied to non-SA characters. --- These should have lb EX: 203C;NS # DOUBLE EXCLAMATION MARK 203D;NS # INTERROBANG 2047;NS # DOUBLE QUESTION MARK 2048;NS # QUESTION EXCLAMATION MARK 2049;NS # EXCLAMATION QUESTION MARK On the other hand NS and EX are very similar, and maybe should be merged. *** the difference between NS and EX is that a break before an NS is allowed where there is a space. As far as I recall, the differentiation between NS and EX, and the difference in treatment of double punctuation goes back to the JIS X4051 standard. There's definitely no need to merge these two classes. --- These should be OP: 00A1;AI # INVERTED EXCLAMATION MARK 00BF;AI # INVERTED QUESTION MARK *** The class AI is intended to support legacy behavior for a limited set of characters that appear as 'full width' characters in East Asian character sets. For all of these the legacy behavior is ID. However, the non-legacy behavior for 00A1 and 00BF is better represented as OP. Currently UAX#14 suggests a tailoring. If legacy support for these characters is deemed not useful, they can be moved to OP in a future version. In practice, if these characters are followed by letters, then AL and OP are effectively indistinguishable. I believe that's the usage in Spanish. --- Not sure why these have lb EX, instead of IS or PO: 060C;EX # ARABIC COMMA 061B;EX # ARABIC SEMICOLON 061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK 066A;EX # ARABIC PERCENT SIGN 06D4;EX # ARABIC FULL STOP *** EX is for sentence ending, IS and PO are for numeric punctuation. The percent sign looks like an oversight - at least I can't recall a rationale for not treating it as part of a numeric expression. If no objections are raised, we can consider this for a future version. But see below. --- Commas in general have a strange mixture of lb property settings: 002C;IS # COMMA 055D;AL # ARMENIAN COMMA 060C;EX # ARABIC COMMA 07F8;IS # NKO COMMA 1363;AL # ETHIOPIC COMMA 1802;BA # MONGOLIAN COMMA 1808;BA # MONGOLIAN MANCHU COMMA 3001;CL # IDEOGRAPHIC COMMA FE10;IS # PRESENTATION FORM FOR VERTICAL COMMA FE11;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA FE50;CL # SMALL COMMA FE51;ID # SMALL IDEOGRAPHIC COMMA FF0C;CL # FULLWIDTH COMMA FF64;CL # HALFWIDTH IDEOGRAPHIC COMMA *** Having all commas behave the same would be incorrect. Where commas are used, by default, inside numeric expressions, they would be IS. Commas usually require a space to break from following letters or numbers, which is true for CL but not true for EX. Assigning the class BA to a comma, allows a break after a space *before* the comma, while this is prevented for EX. AL treats the comma like a letter (which is kept together with other letters and numbers and following open parens, but not some other punctuation) and ID allows a break before or after. Collating this list and comparing the assignments is a useful sanity check. I agree that it appears doubtful that all the choices in this list (and I added 002C for completeness) are 100% accurate. However, any recommendation for change must be accompanied by evidence of the desired behavior (or examples where the current assignment produces incorrect results). A bit of background: the default for a comma is currently CL. This is a bit stricter than "Western" line breaking, but required by East Asian rules. As the design point for the algorithm is to accommodate both, where possible, using CL as a default appropriate. The exception are commas that are part of numerical notation, they need to be IS. However, a point can be made that for scripts that are unlikely to occur in East Asian context, such 'stricter' behavior is not needed (and AL is sufficient, where spaces are required to allow a break). On the other hand, preventing a comma from starting a line is mainly preventing what are marginal to poor line breaks, even outside the East Asian context, therefore AL seems to have little to recommend it. BA is only appropriate where x,y must be allowed to break before the y without a space *and* where x ,y must also break after the x. (EX does the former, but not the latter, and CL breaks only in the case x, y). BA is really more appropriate for hyphen like divider punctuation, not commas. It would be helpful if experts for the various scripts could: -verify that NKO comma is in numeric use -verify that Mongolia commas may appear at the beginning of a line following a space -verify that Arabic comma can break from preceding letters without a space and that they don't require a space following them to break a line -verify that Armenian and Ethiopic commas behave like letters -verify that FE51 really needs to be ID If any of these verifications fails, UTC should re-examine these assignments for a future version. --- So do full stops, some are even AL which I find particularly surprising: 002E;IS # FULL STOP 0589;IS # ARMENIAN FULL STOP 06D4;EX # ARABIC FULL STOP 0701;AL # SYRIAC SUPRALINEAR FULL STOP 0702;AL # SYRIAC SUBLINEAR FULL STOP 1362;AL # ETHIOPIC FULL STOP 166E;AL # CANADIAN SYLLABICS FULL STOP 1803;BA # MONGOLIAN FULL STOP 1809;BA # MONGOLIAN MANCHU FULL STOP 2CF9;BA # COPTIC OLD NUBIAN FULL STOP 2CFE;BA # COPTIC FULL STOP 3002;CL # IDEOGRAPHIC FULL STOP FE12;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP FE52;CL # SMALL FULL STOP FF0E;CL # FULLWIDTH FULL STOP FF61;CL # HALFWIDTH IDEOGRAPHIC FULL STOP *** The situation is the same for periods as for commas. It would be helpful if experts for the various scripts could - verify that 0589 is used numerically - verify that Arabic full stop can break from preceding letters without space - verify that Ethiopic and Canadian Syllabics full stops act like letters - verify that Mongolian and Coptic full stops may appear at the beginning of a line following a space, and that they don't require a space following them to break a line. If any of these verifications fails, UTC should re-examine these assignments for a future version. --- And semicolons (but I don't know what reversed semicolon is used for): 003B;IS # SEMICOLON 061B;EX # ARABIC SEMICOLON 1364;AL # ETHIOPIC SEMICOLON 204F;AL # REVERSED SEMICOLON FE14;IS # PRESENTATION FORM FOR VERTICAL SEMICOLON FE54;NS # SMALL SEMICOLON FF1B;NS # FULLWIDTH SEMICOLON *** Somewhat similar in that review of the issues for comma and period would tend to point towards the correct assignment for semicolons of the same scripts. A wrinkle is the NS. I believe that goes back to 4051. The reversed semicolon needs review - I don't understand it's usage either, we need someone knowledgeable in its use to give input. --- Control characters (good that VT got lb prop BK): NBH (0083) should have the lb value WJ, like WJ and ZWNBSP. BPH (0082) should have the lb value ZW, like ZWSP. *** I would recommend against enshrining these in UAX#14 - unless we are presented with evidence that they are (reasonably widely) used that way in Unicode context. However, this points out another issue, that is that the current formulation does not allow tailoring of control characters. This may be fine, in the sense that line-breaking terminal emulation data streams could simply be considered ipso facto a different algorithm, rather than a tailoring of the Unicode line breaking algorithm. Just like an HTML and XML parser are really different beasts (although similar) and governed by different conformance requirements. It's worth bringing this to UTC attention as an issue to be resolved. --- NL and BK are the same, so there's no need for two lb values. So I suggest merging NL and BK to just BK, i.e. let NEL have BK. *** UTC made the deliberate decision to not make that change when we first realized this. If others agree that this marginal simplification would be useful, there's no reason why we couldn't change this in a future version - the impact on conformance is nil. --- ------------------------------------------------------