Re: TR29 Word Break awkwardness

From: Andy Heninger (andyh@jtcsv.com)
Date: Thu Sep 16 2004 - 14:11:13 CDT

  • Next message: Philippe Verdy: "Historic scripts for Albanian: Elsaban and Beitha Kukju"

    Asmus Freytag wrote:
    > There are no easy answers.

    Yes, this stuff gets convoluted very quickly.

    For efficient implementation of the defualt rules (single pass, no
    backtracking DFA with reasonably understandable and maintainable state
    table generation), it would be very helpful to get rid of the overlap
    between Grapheme Extend and ALetter.

    What doesn't work very well for me is the current situation, with
    characters that, depending on context, either participate or don't
    particpate in the application of the rules.

    This change is pretty much orthogonal to the Hebrew question raised by
    Peter. It only impacts combining characters with _no_ base character,
    which should not be the norm.

       -- Andy

    > It might be worth stepping back and asking the question: What is the
    > purpose of publishing word-breaking behavior as part of the Unicode
    > Standard?
    >
    > The answer to this question is neither easy nor obvious. Part of the
    > problem is that what constitutes a 'word' is subject to tailoring. In
    > certain languages and or certain situations, implementations may need to
    > make different choices in behavior than the ones we document.
    > That puts a natural limit on how 'accurate' our default rules can and
    > should be for certain rare and special cases.
    >
    > On the other hand, certain characters explicitly have word-connecting
    > semantics. Part of our wordbreaking rules is to provide a convenient
    > description of that behavior (together with a list of characters
    > affected). In some instances, such special semantics are not subject to
    > tailoring, as doing so would subvert the reason for the existence of a
    > given character in the Unicode Standard. Note that n the description of
    > linebreaking, which has similar issues to the ones discussed here, such
    > characters are called out explicitly.
    >
    > Finally, there are some complexities that result from the overall
    > architecture of the Unicode Standard (and which in turn are driven by
    > complexities in the writing systems it attempts to cover). The existence
    > of combining marks and grapheme clusters are one of them. One of the
    > purposes of providing the word break rules must be to give accurate
    > guidance on handling these complexities. That in turn argues for (rather
    > than against) detailed specifications of edge cases involving combining
    > marks, absence of base characters, etc., so that not all implementers
    > have to rediscover such issues independently.
    >
    > After this preamble, what can we conclude about the issue?
    >
    > The rule "treat combining marks" like their base character works well if
    > the base character has definite behavior, i.e. is either a letter or a
    > symbol. In such cases the combining sequence acts like a letter or
    > symbol (or number, or punctuation mark) and that most naturally follows
    > from the fact that most combining marks are used as distinguishing
    > decorations on character, or, act like letters following other letters.
    > (The use of combining marks on symbols or punctuation is in the domain
    > of specialized notations, such as mathematics or musical notation, both
    > of which can be expected to heavily tailor word breaking and other
    > default algorithms)
    >
    > However, when NBSP is used as base character standin (the use of SPACE
    > for this will be deprecated in 4.1 for reasons that should have become
    > abundantly clear in this discussion alone) the resultant combining
    > sequence is best treated as a letter, not as a space. In line breaking,
    > the use of NBSP has that effect, as the use of NBSP and letters are very
    > similar, but in word breaking this is not true.
    >
    > It might make sense to explicitly add these rules:
    >
    > 1) treat all sequences of NBSP followed by combining marks as ALetter
    > 2) treat all combining marks w/o base character (i.e. start of text,
    > after control codes) as ALetter
    >
    > (rule 2 is implicitly captured by making Grapheme_Extend < ALetter).
    >
    > However, in linebreaking we found that implementing rules of the form:
    > "treat all x followed by combining marks as y" are difficult to
    > implement, with fewer difficulties for the case where x = y.
    >
    > Adding an INVISIBLE, zero-width, base-letter of class ALetter for
    > wordbreak and AL for linebreak would allow those who mangage to enter
    > one of them into a text to specify precisely that any combining sequence
    > with this base letter should be treated like a letter for these
    > purposes. It would also allow the placement of diacritical marks
    > *between* letters (as cited by J. Knappen), but only if layout engines
    > implement a true zero width for the combining sequence. (The latter
    > would clash with the use of this character in citing a standalone
    > diacritic, where some width usually is desirable).
    >
    > The biggest problem in adding a new letter is that all rendering
    > implementations would need to be updated to recognize it, and all input
    > methods would have to be updated to allow users access to it. Especially
    > due to the latter, the use of SPACE (and possibly NBSP) for this purpose
    > will continue.
    >
    > There are no easy answers.
    > A./



    This archive was generated by hypermail 2.1.5 : Thu Sep 16 2004 - 14:12:59 CDT