L2/01-413

From: Mark Davis
Sent: Tuesday, October 30, 2001 3:08 PM

Subject:  Grapheme Break 


I want to reiterate that I think we made a mistake by not including the
inverse of GRAPHEME JOINER, and that we do need to take some action.

The much more important requirement from the field is for the inverse: the
grapheme break (GB). This would be used to indicate that a particular
sequence in a given language is *not* actually a grapheme cluster. That
would allow, for example, Slovak dictionaries and databases to flag the 1%
cases where "ch" in Slovak is to be sorted as two separate characters with
a
GB; and not require flagging the 99% cases where it *is* considered a
single
character in Slovak with a GJ. [Another example, which has come up on the
Unicode list recently, is "aa" in Danish.] It is clearly preferable to flag
the exceptions rather than the normal cases in those languages.

Let's look at the alternatives for breaking grapheme clusters:

- ZWSP (aka 'allow line break'). Won't  work, since it allows linebreak at
that point

- SHY (soft hyphen). Won't work, since the position many not be a
hyphenation point.

- ZWJ & ZWNJ. Won't work, since they can cause/break ligatures / cursive
connections where not desired.

- ZWNBSP: May work. The only one that we might be able to overload is
ZWNBSP
(aka 'disallow line break'), or better yet, its new semantic replacement
WORD JOINER. I believe that such an overload would work for Latin and most
other scripts. It would not work for a script that:

(a) allows line break between letters and would thus need WORD JOINER to
manually indicate specific positions that disallow linebreak (Thai and
other
languages that break between letters), AND

(b) has multi-base character grapheme clusters in collation, AND

(c) sometimes treats those multi-base character grapheme clusters as
separate letters in collation.

Scripts and/or situations in which all three of these conditions are
fulfilled may be so unusual that we could stretch the semantics for WORD
JOINER, and avoid encoding the inverse function as a separate character.

However, the biggest barrier to this is that the semantics conflict
conceptually to such a high degree: *join* words vs *break* grapheme
clusters.

- Others. The other possibilities are even uglier. Here is the set of Cf &
Cc's

0000..001F    ; Cc #  [32] <control>..<control>
007F..009F    ; Cc #  [33] <control>..<control>
070F          ; Cf #       SYRIAC ABBREVIATION MARK
180B..180E    ; Cf #   [4] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN
VOWEL SEPARATOR
200C..200F    ; Cf #   [4] ZERO WIDTH NON-JOINER..RIGHT-TO-LEFT MARK
202A..202E    ; Cf #   [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
206A..206F    ; Cf #   [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES
FEFF          ; Cf #       ZERO WIDTH NO-BREAK SPACE
FFF9..FFFB    ; Cf #   [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR
ANNOTATION TERMINATOR
1D173..1D17A  ; Cf #   [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END
PHRASE
E0001         ; Cf #       LANGUAGE TAG
E0020..E007F  ; Cf #  [96] TAG SPACE..CANCEL TAG

I think by far the cleanest thing to do is to encode another character.
However, should we decide against that, we need to decide which of the
above
should have its semantics enlarged ("stretched") to encompass the usage.