L2/01-413 From: Mark Davis Sent: Tuesday, October 30, 2001 3:08 PM Subject: Grapheme Break I want to reiterate that I think we made a mistake by not including the inverse of GRAPHEME JOINER, and that we do need to take some action. The much more important requirement from the field is for the inverse: the grapheme break (GB). This would be used to indicate that a particular sequence in a given language is *not* actually a grapheme cluster. That would allow, for example, Slovak dictionaries and databases to flag the 1% cases where "ch" in Slovak is to be sorted as two separate characters with a GB; and not require flagging the 99% cases where it *is* considered a single character in Slovak with a GJ. [Another example, which has come up on the Unicode list recently, is "aa" in Danish.] It is clearly preferable to flag the exceptions rather than the normal cases in those languages. Let's look at the alternatives for breaking grapheme clusters: - ZWSP (aka 'allow line break'). Won't work, since it allows linebreak at that point - SHY (soft hyphen). Won't work, since the position many not be a hyphenation point. - ZWJ & ZWNJ. Won't work, since they can cause/break ligatures / cursive connections where not desired. - ZWNBSP: May work. The only one that we might be able to overload is ZWNBSP (aka 'disallow line break'), or better yet, its new semantic replacement WORD JOINER. I believe that such an overload would work for Latin and most other scripts. It would not work for a script that: (a) allows line break between letters and would thus need WORD JOINER to manually indicate specific positions that disallow linebreak (Thai and other languages that break between letters), AND (b) has multi-base character grapheme clusters in collation, AND (c) sometimes treats those multi-base character grapheme clusters as separate letters in collation. Scripts and/or situations in which all three of these conditions are fulfilled may be so unusual that we could stretch the semantics for WORD JOINER, and avoid encoding the inverse function as a separate character. However, the biggest barrier to this is that the semantics conflict conceptually to such a high degree: *join* words vs *break* grapheme clusters. - Others. The other possibilities are even uglier. Here is the set of Cf & Cc's 0000..001F ; Cc # [32] .. 007F..009F ; Cc # [33] .. 070F ; Cf # SYRIAC ABBREVIATION MARK 180B..180E ; Cf # [4] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN VOWEL SEPARATOR 200C..200F ; Cf # [4] ZERO WIDTH NON-JOINER..RIGHT-TO-LEFT MARK 202A..202E ; Cf # [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE 206A..206F ; Cf # [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES FEFF ; Cf # ZERO WIDTH NO-BREAK SPACE FFF9..FFFB ; Cf # [3] INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR ANNOTATION TERMINATOR 1D173..1D17A ; Cf # [8] MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE E0001 ; Cf # LANGUAGE TAG E0020..E007F ; Cf # [96] TAG SPACE..CANCEL TAG I think by far the cleanest thing to do is to encode another character. However, should we decide against that, we need to decide which of the above should have its semantics enlarged ("stretched") to encompass the usage.