Re: TR29 Word Break awkwardness

From: Asmus Freytag (
Date: Tue Sep 14 2004 - 17:59:52 CDT

    It might be worth stepping back and asking the question: What is the
    purpose of publishing word-breaking behavior as part of the Unicode Standard?

    The answer to this question is neither easy nor obvious. Part of the
    problem is that what constitutes a 'word' is subject to tailoring. In
    certain languages and or certain situations, implementations may need to
    make different choices in behavior than the ones we document.
    That puts a natural limit on how 'accurate' our default rules can and
    should be for certain rare and special cases.

    On the other hand, certain characters explicitly have word-connecting
    semantics. Part of our wordbreaking rules is to provide a convenient
    description of that behavior (together with a list of characters affected).
    In some instances, such special semantics are not subject to tailoring, as
    doing so would subvert the reason for the existence of a given character in
    the Unicode Standard. Note that n the description of linebreaking, which
    has similar issues to the ones discussed here, such characters are called
    out explicitly.

    Finally, there are some complexities that result from the overall
    architecture of the Unicode Standard (and which in turn are driven by
    complexities in the writing systems it attempts to cover). The existence of
    combining marks and grapheme clusters are one of them. One of the purposes
    of providing the word break rules must be to give accurate guidance on
    handling these complexities. That in turn argues for (rather than against)
    detailed specifications of edge cases involving combining marks, absence of
    base characters, etc., so that not all implementers have to rediscover such
    issues independently.

    After this preamble, what can we conclude about the issue?

    The rule "treat combining marks" like their base character works well if
    the base character has definite behavior, i.e. is either a letter or a
    symbol. In such cases the combining sequence acts like a letter or symbol
    (or number, or punctuation mark) and that most naturally follows from the
    fact that most combining marks are used as distinguishing decorations on
    character, or, act like letters following other letters. (The use of
    combining marks on symbols or punctuation is in the domain of specialized
    notations, such as mathematics or musical notation, both of which can be
    expected to heavily tailor word breaking and other default algorithms)

    However, when NBSP is used as base character standin (the use of SPACE for
    this will be deprecated in 4.1 for reasons that should have become
    abundantly clear in this discussion alone) the resultant combining sequence
    is best treated as a letter, not as a space. In line breaking, the use of
    NBSP has that effect, as the use of NBSP and letters are very similar, but
    in word breaking this is not true.

    It might make sense to explicitly add these rules:

    1) treat all sequences of NBSP followed by combining marks as ALetter
    2) treat all combining marks w/o base character (i.e. start of text, after
    control codes) as ALetter

    (rule 2 is implicitly captured by making Grapheme_Extend < ALetter).

    However, in linebreaking we found that implementing rules of the form:
    "treat all x followed by combining marks as y" are difficult to implement,
    with fewer difficulties for the case where x = y.

    Adding an INVISIBLE, zero-width, base-letter of class ALetter for wordbreak
    and AL for linebreak would allow those who mangage to enter one of them
    into a text to specify precisely that any combining sequence with this base
    letter should be treated like a letter for these purposes. It would also
    allow the placement of diacritical marks *between* letters (as cited by J.
    Knappen), but only if layout engines implement a true zero width for the
    combining sequence. (The latter would clash with the use of this character
    in citing a standalone diacritic, where some width usually is desirable).

    The biggest problem in adding a new letter is that all rendering
    implementations would need to be updated to recognize it, and all input
    methods would have to be updated to allow users access to it. Especially
    due to the latter, the use of SPACE (and possibly NBSP) for this purpose
    will continue.

    There are no easy answers.

