From: Asmus Freytag (firstname.lastname@example.org)
Date: Tue Sep 14 2004 - 17:59:52 CDT
It might be worth stepping back and asking the question: What is the
purpose of publishing word-breaking behavior as part of the Unicode Standard?
The answer to this question is neither easy nor obvious. Part of the
problem is that what constitutes a 'word' is subject to tailoring. In
certain languages and or certain situations, implementations may need to
make different choices in behavior than the ones we document.
That puts a natural limit on how 'accurate' our default rules can and
should be for certain rare and special cases.
On the other hand, certain characters explicitly have word-connecting
semantics. Part of our wordbreaking rules is to provide a convenient
description of that behavior (together with a list of characters affected).
In some instances, such special semantics are not subject to tailoring, as
doing so would subvert the reason for the existence of a given character in
the Unicode Standard. Note that n the description of linebreaking, which
has similar issues to the ones discussed here, such characters are called
Finally, there are some complexities that result from the overall
architecture of the Unicode Standard (and which in turn are driven by
complexities in the writing systems it attempts to cover). The existence of
combining marks and grapheme clusters are one of them. One of the purposes
of providing the word break rules must be to give accurate guidance on
handling these complexities. That in turn argues for (rather than against)
detailed specifications of edge cases involving combining marks, absence of
base characters, etc., so that not all implementers have to rediscover such
After this preamble, what can we conclude about the issue?
The rule "treat combining marks" like their base character works well if
the base character has definite behavior, i.e. is either a letter or a
symbol. In such cases the combining sequence acts like a letter or symbol
(or number, or punctuation mark) and that most naturally follows from the
fact that most combining marks are used as distinguishing decorations on
character, or, act like letters following other letters. (The use of
combining marks on symbols or punctuation is in the domain of specialized
notations, such as mathematics or musical notation, both of which can be
expected to heavily tailor word breaking and other default algorithms)
However, when NBSP is used as base character standin (the use of SPACE for
this will be deprecated in 4.1 for reasons that should have become
abundantly clear in this discussion alone) the resultant combining sequence
is best treated as a letter, not as a space. In line breaking, the use of
NBSP has that effect, as the use of NBSP and letters are very similar, but
in word breaking this is not true.
It might make sense to explicitly add these rules:
1) treat all sequences of NBSP followed by combining marks as ALetter
2) treat all combining marks w/o base character (i.e. start of text, after
control codes) as ALetter
(rule 2 is implicitly captured by making Grapheme_Extend < ALetter).
However, in linebreaking we found that implementing rules of the form:
"treat all x followed by combining marks as y" are difficult to implement,
with fewer difficulties for the case where x = y.
Adding an INVISIBLE, zero-width, base-letter of class ALetter for wordbreak
and AL for linebreak would allow those who mangage to enter one of them
into a text to specify precisely that any combining sequence with this base
letter should be treated like a letter for these purposes. It would also
allow the placement of diacritical marks *between* letters (as cited by J.
Knappen), but only if layout engines implement a true zero width for the
combining sequence. (The latter would clash with the use of this character
in citing a standalone diacritic, where some width usually is desirable).
The biggest problem in adding a new letter is that all rendering
implementations would need to be updated to recognize it, and all input
methods would have to be updated to allow users access to it. Especially
due to the latter, the use of SPACE (and possibly NBSP) for this purpose
There are no easy answers.
This archive was generated by hypermail 2.1.5 : Tue Sep 14 2004 - 18:02:58 CDT