From: Asmus Freytag (firstname.lastname@example.org)
Date: Sat Dec 01 2007 - 04:26:16 CST
On 12/1/2007 12:45 AM, Jukka K. Korpela wrote:
> The line breaking rules often appear to be based in the consideration of
> _some_ special cases (and perhaps _very_ special cases), where they
> might help to avoid some problems. But the question is whether they
> cause more trouble in other cases and whether the problems could be
> solved in simpler ways.
> The _general_ rules should be simple and conservative, trying to
> minimize bad breaks rather than to find as many break opportunities as
> possible. When you disallow a break that could be allowed, you may get
> suboptimal typography.
You seem to want a number of contradictory things.
Rule LB15 got its origin from just such an attempt to be conservative
when in doubt, realizing that allowing a bad break can be more damaging
than missing a break opportunity.
The algorithm is intended for multilingual text or for multilingual
environments. It can therefore _not_ simply assume that spaces are what
makes the break. Doing so, would cause very suboptimal typography for
The original algorithm, before rule 15, was tested in shipping
implementations before offering it as a seed for the standardization
effort. It was itself based on European de-facto practice and certain
Asian standards in the area of linebreaking.
Rule 15 was added when Unicode discovered that it had cavalierly assumed
it knew which characters are opening and which are closing quotes. (All
quotes used to have OP or CL or, their equivalents in general category).
This was found to be erroneous, but left the problem of how to deal with
these suddenly ambiguous characters.
Because a bad linebreak following an opening punctuation (or right
before a closing punctuation) is a very serious issue in non-Western
line layout, the UTC adopted the cautious formulation of Rule 15.
In this case, that's probably the best you can do for a standard.
However, there are a number of implementation approaches that might be
useful as suggested tailorings.
Given a sequence CL SP+ QU or QU SP+ OP assume that the QU is of the
_opposite_ type to the other punctuation mark. This implements the
heuristic that quoted material is not likely to start or end with spaces
and that the direction of the other punctuation mark identifies the
start/end of a text run that is then most likely outside the quote. This
can be done during assignment of linebreak classes, or by rewriting rule 15.
Treat all QU adjacent to AL as AL, but otherwise allow a break QU SP OP
etc. That handles the case of "The Wire" (2005), but would fail in Asian
environments, whenever the rule AL x AL is relaxed. (The latter is a not
uncommon tailoring, and one of the design criteria was to allow that to
be a tailoring that's easily handled, which rules out the use of space
as 'primary' break opportunity).
There are a number of other alternatives that work well where the
language of the text isn't known, but where the implementer wants to
make a different tradeoff from the default.
As edge cases like this are discovered, the best approach is
documentation. I'm hoping Andy Heninger who maintains UAX#14 now, is
reading this, and can put some of these things into the section on
This archive was generated by hypermail 2.1.5 : Sat Dec 01 2007 - 04:29:29 CST