Re: UAX #14: no line breaks between OP and QU, even if there are intervening spaces

From: Asmus Freytag (
Date: Sat Dec 01 2007 - 04:26:16 CST

  • Next message: "RE: Display of Mongolian in Arabic or Hebrew documents"

    On 12/1/2007 12:45 AM, Jukka K. Korpela wrote:
    > The line breaking rules often appear to be based in the consideration of
    > _some_ special cases (and perhaps _very_ special cases), where they
    > might help to avoid some problems. But the question is whether they
    > cause more trouble in other cases and whether the problems could be
    > solved in simpler ways.
    > The _general_ rules should be simple and conservative, trying to
    > minimize bad breaks rather than to find as many break opportunities as
    > possible. When you disallow a break that could be allowed, you may get
    > suboptimal typography.
    You seem to want a number of contradictory things.

    Rule LB15 got its origin from just such an attempt to be conservative
    when in doubt, realizing that allowing a bad break can be more damaging
    than missing a break opportunity.

    The algorithm is intended for multilingual text or for multilingual
    environments. It can therefore _not_ simply assume that spaces are what
    makes the break. Doing so, would cause very suboptimal typography for
    Asian contexts.

    The original algorithm, before rule 15, was tested in shipping
    implementations before offering it as a seed for the standardization
    effort. It was itself based on European de-facto practice and certain
    Asian standards in the area of linebreaking.

    Rule 15 was added when Unicode discovered that it had cavalierly assumed
    it knew which characters are opening and which are closing quotes. (All
    quotes used to have OP or CL or, their equivalents in general category).
    This was found to be erroneous, but left the problem of how to deal with
    these suddenly ambiguous characters.

    Because a bad linebreak following an opening punctuation (or right
    before a closing punctuation) is a very serious issue in non-Western
    line layout, the UTC adopted the cautious formulation of Rule 15.

    In this case, that's probably the best you can do for a standard.

    However, there are a number of implementation approaches that might be
    useful as suggested tailorings.

    Given a sequence CL SP+ QU or QU SP+ OP assume that the QU is of the
    _opposite_ type to the other punctuation mark. This implements the
    heuristic that quoted material is not likely to start or end with spaces
    and that the direction of the other punctuation mark identifies the
    start/end of a text run that is then most likely outside the quote. This
    can be done during assignment of linebreak classes, or by rewriting rule 15.

    Treat all QU adjacent to AL as AL, but otherwise allow a break QU SP OP
    etc. That handles the case of "The Wire" (2005), but would fail in Asian
    environments, whenever the rule AL x AL is relaxed. (The latter is a not
    uncommon tailoring, and one of the design criteria was to allow that to
    be a tailoring that's easily handled, which rules out the use of space
    as 'primary' break opportunity).

    There are a number of other alternatives that work well where the
    language of the text isn't known, but where the implementer wants to
    make a different tradeoff from the default.

    As edge cases like this are discovered, the best approach is
    documentation. I'm hoping Andy Heninger who maintains UAX#14 now, is
    reading this, and can put some of these things into the section on
    suggested tailorings.


    This archive was generated by hypermail 2.1.5 : Sat Dec 01 2007 - 04:29:29 CST