L2/08-089

Source: Asmus Freytag
Date: Fri, 18 Jan 2008
Subject: Re: UAX 14 behavior of U+3002 IDEOGRAPHIC FULL STOP

Eric and I have been discussing his issue off-line, but we agreed that 
it would be best to copy some of our discussion to the list as these
details might be of interest in a future UTC discussion.

A./

On 1/18/2008 9:32 AM, Eric Muller wrote:
> Asmus Freytag wrote:
>> Eric,
>>
>> are you planning on submitting a formal request to make a change to 
>> the linebreaking specification? 
>
> Not sure yet, because 1) there is little time, and 2) I need to better 
> understand the requirements for ideographic comma and the ideographic 
> brackets. 
Eric,

As the unintended consequences introduced by adding rule 30 and
changing rule 25 in revision 18 has shown, you want to make sure you
know the consequences of your request ahead of time.

> I'll let you know if I come up with something.

Well phrased requirements are always useful. In this case, the
requirements shouldn't have been in question - as we've know about X4051 
from the start, and UAX#14 in its original form was much closer to 4051 
than it is now. Its design point has always been to be an acceptable 
compromise between Western and East Asian linebreaking (exemplified by
JIS X4051) but with the need to tailor it to tune it fully for either 
environment. Simple implementations, using it right out of the box, 
should be able to achieve acceptable text where there are no language or 
style selections.

It was an oversight in the heat of battle, so to speak, that  allowed 
such a far-reaching change during the update for 5.0.0. With a bit of 
distance and without pressure to manage that issue in addition to 
character additions and other noise, I can see that now very clearly.

Therefore, in my view it is best to treat this as a *regression* bug, 
and to construct the remedy accordingly.

Let me summarize what I've learned:
-----------------------------------

The original classification and rules for (ideographic and wide) commas
and periods were derived from JIS X 4051. Up to revision 17 (or Unicode 
4.1.0), UAX#14 supported those rules.

Then came the desire to prevent line breaks in examples like "person(s)" 
and negative cent values. These are indeed an issue, since they are not 
uncommon and are handled correctly by default in any ASCII-only 
linebreaking scheme.

Now, you come and report an issue with (ideographic and wide) commas and
periods. (Which used to work fine).

The best solution is to back out of the changes made in revision 18 (as
being too broad) and then sitting back and figuring out what the minimal
required changes would have been.

The examples that were offered were for "(" and ")". You would agree
that "[" or '']" are much more rarely used to but in principle behave
the same way - however many of the other characters in OP and CL do not
(e.g. none of the East Asian quotes or brackets).

So the real need was to do something about "(" and ")" in the context of
$, %, letters and numbers (in other words, classes PR, PO, AL, NU.)

This can be achieved by switching ")"  (and perhaps also "]" and "}")
from class CL to class IS, which has the desired behavior. It's also not
merely the case that IS "happens to fit the bill" because IS is intended 
to cover punctuation that can be used in numeric expressions. Note "(" 
can't be moved to IS, since IS is treated as trailing, not opening 
punctuation when not part of a number.

However, the desired behavior of ")" is taken care of by retaining this 
term from rule 30:
                           (AL | NU) ¡ß OP

and this term from rule 25:
                              PO ¡ß OP

(These two terms would some break opportunities before an EA quote or 
bracket. Such contexts would tend to be fairly uncommon, but it remains 
a compromise. However it would avoid a new LB class for opening, 
non-Asian brackets. A long term solution might be to divide several LB 
classes to require fewer compromises and allow better tailoring. That 
would go far beyond fixing the bug reported by you).

All terms involving class CL in rules 25 and 30 would be deleted (those
terms are the ones that give rise to the reported problems with 
ideographic and wide commas and periods).

Finally, a note should point out that the behavior of ")" is no longer
conformant to X 4051 and that such behavior can be restored by tailoring
")" U+0029 back from IS to CL. The same note should point out,
non-specifically, that the behavior of class PO deviates in several way
from X4051 because when not used in a Japanese context many characters
in class PO are ambiguously used for both prefix and postfix numerical
symbols. Tailoring of the rules for PO would be required for strict
X405-style behavior. (In 4051 break after PO and before PR is usually 
allowed)

There you have it,

A./

PS:  Some additional historical details:

the paper that caused all the trouble was
http://www.unicode.org/L2/L2005/05292-linebreak.html
It's item 2. in that paper (as well as some of item 3) that we are
concerned with here.

In Section 2 of that paper: "*Opening and closing shouldn't break from
alphanums"

*These two  examples given:
    Example: person(s)
    Example:  »wie hier «

Only the first example is something that's common enough to require
"getting right" by default. The second example is a bit of a red
herring. Ambiguous quotes (as "  »" are already handled via class QU and
are not covered by the changes).
Also OP and CL never split from their enclosed text.

By implication, the paper claims that what's needed for "(" is needed
for all brackets. in fact for all characters covered by OP and CL.
There's no evidence to back up that claim. And, as you have seen, class
CL covers more than closing brackets, leading to the regression bug.

The original input to this was Yukka's page cited by the paper. However
that page complains about linebreaks in IE, some of which are not in
conformance with UAX#14. For others. UAX#14 needs to balance where it
places the *default*. Where possible, the result should be compatible
with X4051, so that East Asian texts work reasonably well, but where
necessary, common occurrences of unusual line breaks in Western text
should be suppressed.

Parens in words and numeric contexts are really common, getting them
right in Western context is a must. [] and {} are arguable, but the
ideographic brackets and quotes in OP and CL are clearly designed to NOT
require a space on their outside (preceding OP or following CL) to
break. Changing the summary above to specifically include "]" or "}" or
both in the change from CL to IS is a viable alternative, but it should
stop there.


From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Mon, 28 Jan 2008 16:58:40 -0800
To: Peter Edberg <pedberg@apple.com>
Subject: Re: UAX 14 behavior of U+3002 ¡£ IDEOGRAPHIC FULL STOP
CC: Eric Muller <emuller@adobe.com>, Deborah Goldsmith <goldsmit@apple.com>,
Andy Heninger <heninger@us.ibm.com>, Michel Suignard <michelsu@microsoft.com>
X-Mozilla-Keys: 

On 1/28/2008 1:28 PM, Peter Edberg wrote:
> Looks good to me. Is there already a UTC 114 agenda item for this? If 
> not, I will request one.
> -Peter E
In all of this it should not be overlooked that the current situation 
(5.0.0) *is* a regression, and that this regression needs to be fixed by 
backing out from the changes introduced after 4.1.0.

I have pointed out how the objectives of the intended modifications can 
be achieved by slightly different means, that don't cause the 
regression. In addition to keeping some rule changes for class OP (open 
brakets) it involves moving parens and certain brackets to class IS, 
which is a class that interacts with runs of digits and numbers in the 
desired way. (My last posting to Unicore contains a *detailed* 
prescription how to go about that, so I won't repeat that here).

Implementing those adjustments will achieve the objectives of a 
different behavior for brackets commonly used in Western texts, such as 
() (and could include [] and {}). It will require *no* tailoring for the 
behavior of ID period and comma, which is important; there should not 
ever be a need for tailoring to get script specific characters like that 
to behave correctly in a native context.

There's now a pontential for minor need for tailoring the action of () 
{} and [], so Western text with them acts more "Japanese". That would be 
appropriate. As it is an unspecific tailoring (good for all EA 
languages) it could go into a note in UAX#14.

Language-specific, or document-style specific tailorings do belong in 
CLDR, so that they can be described precisely and invoked precisely. But 
there's a realm of applications for which quick&dirty linebreaking is 
going to be fine, and generic approaches suffice for them. Those 
approaches are best explained in the UAX (there are several examples).

About Erics suggestions for proofing changes: it is *not* enough to 
consider the interaction of LB classes, because the classes are never 
guaranteed to be fine-grained enough. For example, the recent regression 
affected a few cells in Table 2, and none that weren't deliberately 
intended. What was unintended was to have these changes applied to 
ideographic sentence ending punctuation.

The first cut is to update Table 2 with the changes, then diff it 
(manually is fine, its' small) against the preceding version. That gives 
you an unambiguous diff of the change in *rules*.

The second step is to diff the LB class assignments. That gives you an 
unambiguous diff of the change of coverage for certain behavior.

These two steps sometimes are not enough. What you would need to develop 
is a list of *bellwether* characters, which are the *common*, and 
*widely used* characters on which the *orthographic* rules for a 
language operate: period (narrow and wide), comma, parens, letter A, 
digit 1, a few other signs and exemplars for several scripts.

You should expect to have more than one exmplar for each LB class, 
since, again, the classes are not necessarily finegrained enough for the 
purpose. (The new behavior may be more finegrained than what was on the 
table when the original classes were drawn up).

The bellwether characters can then be used to visualize (in a test file) 
any effect of a change in rules. The test file would simply be a diff, 
showing two "words", one with breaks according to the original rules and 
classes, the other according to the new rules and classes. The 
differences in break position could be highlighted in color. That would 
allow reviewers and testers to spot unintended behavior quickly, but 
would mean generating some sophisticated HTML summary from a test 
harness - not something easily realizable by flat files.

A./