L2/08-089 Source: Asmus Freytag Date: Fri, 18 Jan 2008 Subject: Re: UAX 14 behavior of U+3002 IDEOGRAPHIC FULL STOP Eric and I have been discussing his issue off-line, but we agreed that it would be best to copy some of our discussion to the list as these details might be of interest in a future UTC discussion. A./ On 1/18/2008 9:32 AM, Eric Muller wrote: > Asmus Freytag wrote: >> Eric, >> >> are you planning on submitting a formal request to make a change to >> the linebreaking specification? > > Not sure yet, because 1) there is little time, and 2) I need to better > understand the requirements for ideographic comma and the ideographic > brackets. Eric, As the unintended consequences introduced by adding rule 30 and changing rule 25 in revision 18 has shown, you want to make sure you know the consequences of your request ahead of time. > I'll let you know if I come up with something. Well phrased requirements are always useful. In this case, the requirements shouldn't have been in question - as we've know about X4051 from the start, and UAX#14 in its original form was much closer to 4051 than it is now. Its design point has always been to be an acceptable compromise between Western and East Asian linebreaking (exemplified by JIS X4051) but with the need to tailor it to tune it fully for either environment. Simple implementations, using it right out of the box, should be able to achieve acceptable text where there are no language or style selections. It was an oversight in the heat of battle, so to speak, that allowed such a far-reaching change during the update for 5.0.0. With a bit of distance and without pressure to manage that issue in addition to character additions and other noise, I can see that now very clearly. Therefore, in my view it is best to treat this as a *regression* bug, and to construct the remedy accordingly. Let me summarize what I've learned: ----------------------------------- The original classification and rules for (ideographic and wide) commas and periods were derived from JIS X 4051. Up to revision 17 (or Unicode 4.1.0), UAX#14 supported those rules. Then came the desire to prevent line breaks in examples like "person(s)" and negative cent values. These are indeed an issue, since they are not uncommon and are handled correctly by default in any ASCII-only linebreaking scheme. Now, you come and report an issue with (ideographic and wide) commas and periods. (Which used to work fine). The best solution is to back out of the changes made in revision 18 (as being too broad) and then sitting back and figuring out what the minimal required changes would have been. The examples that were offered were for "(" and ")". You would agree that "[" or '']" are much more rarely used to but in principle behave the same way - however many of the other characters in OP and CL do not (e.g. none of the East Asian quotes or brackets). So the real need was to do something about "(" and ")" in the context of $, %, letters and numbers (in other words, classes PR, PO, AL, NU.) This can be achieved by switching ")" (and perhaps also "]" and "}") from class CL to class IS, which has the desired behavior. It's also not merely the case that IS "happens to fit the bill" because IS is intended to cover punctuation that can be used in numeric expressions. Note "(" can't be moved to IS, since IS is treated as trailing, not opening punctuation when not part of a number. However, the desired behavior of ")" is taken care of by retaining this term from rule 30: (AL | NU) ¡ß OP and this term from rule 25: PO ¡ß OP (These two terms would some break opportunities before an EA quote or bracket. Such contexts would tend to be fairly uncommon, but it remains a compromise. However it would avoid a new LB class for opening, non-Asian brackets. A long term solution might be to divide several LB classes to require fewer compromises and allow better tailoring. That would go far beyond fixing the bug reported by you). All terms involving class CL in rules 25 and 30 would be deleted (those terms are the ones that give rise to the reported problems with ideographic and wide commas and periods). Finally, a note should point out that the behavior of ")" is no longer conformant to X 4051 and that such behavior can be restored by tailoring ")" U+0029 back from IS to CL. The same note should point out, non-specifically, that the behavior of class PO deviates in several way from X4051 because when not used in a Japanese context many characters in class PO are ambiguously used for both prefix and postfix numerical symbols. Tailoring of the rules for PO would be required for strict X405-style behavior. (In 4051 break after PO and before PR is usually allowed) There you have it, A./ PS: Some additional historical details: the paper that caused all the trouble was http://www.unicode.org/L2/L2005/05292-linebreak.html It's item 2. in that paper (as well as some of item 3) that we are concerned with here. In Section 2 of that paper: "*Opening and closing shouldn't break from alphanums" *These two examples given: Example: person(s) Example:  »wie hier « Only the first example is something that's common enough to require "getting right" by default. The second example is a bit of a red herring. Ambiguous quotes (as "  »" are already handled via class QU and are not covered by the changes). Also OP and CL never split from their enclosed text. By implication, the paper claims that what's needed for "(" is needed for all brackets. in fact for all characters covered by OP and CL. There's no evidence to back up that claim. And, as you have seen, class CL covers more than closing brackets, leading to the regression bug. The original input to this was Yukka's page cited by the paper. However that page complains about linebreaks in IE, some of which are not in conformance with UAX#14. For others. UAX#14 needs to balance where it places the *default*. Where possible, the result should be compatible with X4051, so that East Asian texts work reasonably well, but where necessary, common occurrences of unusual line breaks in Western text should be suppressed. Parens in words and numeric contexts are really common, getting them right in Western context is a must. [] and {} are arguable, but the ideographic brackets and quotes in OP and CL are clearly designed to NOT require a space on their outside (preceding OP or following CL) to break. Changing the summary above to specifically include "]" or "}" or both in the change from CL to IS is a viable alternative, but it should stop there. From: Asmus Freytag Date: Mon, 28 Jan 2008 16:58:40 -0800 To: Peter Edberg Subject: Re: UAX 14 behavior of U+3002 ¡£ IDEOGRAPHIC FULL STOP CC: Eric Muller , Deborah Goldsmith , Andy Heninger , Michel Suignard X-Mozilla-Keys: On 1/28/2008 1:28 PM, Peter Edberg wrote: > Looks good to me. Is there already a UTC 114 agenda item for this? If > not, I will request one. > -Peter E In all of this it should not be overlooked that the current situation (5.0.0) *is* a regression, and that this regression needs to be fixed by backing out from the changes introduced after 4.1.0. I have pointed out how the objectives of the intended modifications can be achieved by slightly different means, that don't cause the regression. In addition to keeping some rule changes for class OP (open brakets) it involves moving parens and certain brackets to class IS, which is a class that interacts with runs of digits and numbers in the desired way. (My last posting to Unicore contains a *detailed* prescription how to go about that, so I won't repeat that here). Implementing those adjustments will achieve the objectives of a different behavior for brackets commonly used in Western texts, such as () (and could include [] and {}). It will require *no* tailoring for the behavior of ID period and comma, which is important; there should not ever be a need for tailoring to get script specific characters like that to behave correctly in a native context. There's now a pontential for minor need for tailoring the action of () {} and [], so Western text with them acts more "Japanese". That would be appropriate. As it is an unspecific tailoring (good for all EA languages) it could go into a note in UAX#14. Language-specific, or document-style specific tailorings do belong in CLDR, so that they can be described precisely and invoked precisely. But there's a realm of applications for which quick&dirty linebreaking is going to be fine, and generic approaches suffice for them. Those approaches are best explained in the UAX (there are several examples). About Erics suggestions for proofing changes: it is *not* enough to consider the interaction of LB classes, because the classes are never guaranteed to be fine-grained enough. For example, the recent regression affected a few cells in Table 2, and none that weren't deliberately intended. What was unintended was to have these changes applied to ideographic sentence ending punctuation. The first cut is to update Table 2 with the changes, then diff it (manually is fine, its' small) against the preceding version. That gives you an unambiguous diff of the change in *rules*. The second step is to diff the LB class assignments. That gives you an unambiguous diff of the change of coverage for certain behavior. These two steps sometimes are not enough. What you would need to develop is a list of *bellwether* characters, which are the *common*, and *widely used* characters on which the *orthographic* rules for a language operate: period (narrow and wide), comma, parens, letter A, digit 1, a few other signs and exemplars for several scripts. You should expect to have more than one exmplar for each LB class, since, again, the classes are not necessarily finegrained enough for the purpose. (The new behavior may be more finegrained than what was on the table when the original classes were drawn up). The bellwether characters can then be used to visualize (in a test file) any effect of a change in rules. The test file would simply be a diff, showing two "words", one with breaks according to the original rules and classes, the other according to the new rules and classes. The differences in break position could be highlighted in color. That would allow reviewers and testers to spot unintended behavior quickly, but would mean generating some sophisticated HTML summary from a test harness - not something easily realizable by flat files. A./