Re: UAX #14: no line breaks between OP and QU, even if there are intervening spaces

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Dec 01 2007 - 02:45:25 CST

  • Next message: Asmus Freytag: "Re: UAX #14: no line breaks between OP and QU, even if there are intervening spaces"

    Asmus Freytag wrote:

    > The *default* line breaking algorithm in UAX#14 tries to meet several
    > constraints.
    >
    > 1) to be compatible with Kinsoku rules
    > 2) to be language neutral
    > 3) to be compatible with generic Western rules

    I don't see why any of that requires odd-looking rules involving
    punctuation marks and special characters, especially when they are in
    conflict with item 3.

    > The class QU means "either a closing or an opening quotation mark" and
    > reflects the lack of knowledge about actual usage. (By the way, some
    > languages use the *same* quotation mark as both opening and closing).

    Indeed. And we don't really know a character, we shouldn't mess around
    with it.

    General line breaking rules, to the extent they are needed and can be
    meaningfully formulated, should be limited to allowing a break at any
    space, allowing breaks in a string script-specific characters if allowed
    by the rules of that script, obeying explicit line break prohibitions
    and permissions, and disallowing other breaks. In particular, a space
    should be treated as breaking, since it is much more natural to treat
    special cases (where a break is not permitted) using either no-break
    space or higher-level protocol tools than to work against the artificial
    line-break prohibitions.

    > If the character was an opening mark, you really don't want to have a
    > line break after it. The enclosed quotation might start with a space.

    When did you last see such a case?

    It must be a _very_ rare situation. Why would you include a leading
    quote? If you were thinking of the French spacing, then it's a special
    issue that needs special attention, not this kind of treatment in a very
    rare case. (The French spacing after an opening quotation mark should
    really be a narrow no-break space.)

    > To fix this, an implementation needs to tailor the assignment of
    > linebreak classes to supply additional information. In other words, if
    > IE encounters a " and, by some rule not defined in UAX#14, decides
    > that
    > one of them is in fact an OP and the other is a CL then

    ... then cows will fly. It is unrealistic to expect that a sophisticated
    linguistic analysis will be applied to make a decision in overriding a
    line break prohibition. (It is not sufficient to know the language the
    text, and the text is generally not known. The language markup is
    currently rarely used and far too often plain wrong to be trusted. Doing
    language-guessing on an entire document is feasible, though not very
    reliable, but this would have to be made at the phrase level.)

    The line breaking rules often appear to be based in the consideration of
    _some_ special cases (and perhaps _very_ special cases), where they
    might help to avoid some problems. But the question is whether they
    cause more trouble in other cases and whether the problems could be
    solved in simpler ways.

    > Using the untailored default algorithm is intended for situations
    > where
    > the necessary information is
    > lacking that would allow an implementer to select a specific
    > tailoring. Doing so, results in a better average
    > performance (for global text) than implementing ASCII line break
    > (break at space and hyphen only),
    > which fails abysmally for non-European text.

    The issue of breaking normal text - written using letters, syllabic
    characters, or ideograms - is quite separate from the issue of
    artificial rules that involve punctuation and special characters.

    Ascii hyphen, i.e. HYPHEN-MINUS, shouldn't really be treated as allowing
    a break, due to its semantic ambiguity and variation in usage. Breaking
    after it might be allowed by language- or application-specific rules,
    rather than being allowed by default and disallowed by special rules.
    The _general_ rules should be simple and conservative, trying to
    minimize bad breaks rather than to find as many break opportunities as
    possible. When you disallow a break that could be allowed, you may get
    suboptimal typography. When you allow a break should not be allowed, you
    may distort data, e.g. effectively changing "-1" to "- 1" or "directory
    /foo" to "directory / foo". On the other hand, breaking at spaces should
    not be restricted by the general rules, since it is reasonable to expect
    that spaces are treated as breaking, so that special measures need to be
    taken to prevent it.

    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/



    This archive was generated by hypermail 2.1.5 : Sat Dec 01 2007 - 02:48:35 CST