L2/02-043

To: UTC
Re: Default Grapheme Cluster Boundaries
From: Mark Davis
Date: 2001-01-24

There were some errors in the draft Unicode 3.2 Default Grapheme Cluster Boundaries table. Here is a proposal to address them, as a basis for discussion at the UTC meeting. The proposal also requires some changes to the data backing the categories Link, Extend, and Base, so those classes would need to be regenerated.

Note: In the past, we have tended to make the categories (except Any) be disjoint. The following table doesn't do that, to make the rules simpler.

Note: We could make the grapheme cluster a purely pairwise determination (a good thing), if we were willing to make the following change:

Link × ( IndicBase | Join_Control )  (4')
Join_Control × IndicBase  (5')

In the absence of a Link what that would do is make a Join_Control belong to the following Indic grapheme cluster. It should not have any effect in practice on well-formed Indic.

Note: Ideally, we would have another derived property, IndicBase. However, if we wanted to avoid that at this date we could have the textual specification and leave the actual construction to implementers.


Table 5-3. Default Grapheme Cluster Boundaries


Character Classes

CR Carriage Return
LF Line Feed
CGJ Combining Grapheme Joiner
Join_Control Join_Control, as determined by the UCD.
Link Grapheme_Link, as determined by the UCD.  Includes most viramas but not the grapheme joiner.
Extend Grapheme_Extend, as determined by the UCD. Includes combining marks, all characters in Link, the CGJ, format controls (including Join_Control), and variation selectors.
Base Grapheme_Base, as determined by the UCD. Also includes L, V, T, LV, LVT.
IndicBase Base characters from any Indic script that has a character in Link.
L Hangul leading jamo U+1100..U+115F
V Hangul vowel jamo U+1160..U+11A2
T Hangul trailing jamo U+11A8..U+11F9
LV Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V>
LVT Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V,T>
Any Any character (includes all of the above)

Rules

Do not break between a CR and LF

CR × LF (1)

Do not break between a base character and a combining mark, or within a sequence of combining marks.

( Base | Extend ) × Extend (2)

Do not break between a CGJ and a base letter.

CGJ × Base  (3)

Do not break between link characters and base characters. Do not break around a join control if it is preceded by a link and followed by a base. These rules provide for Indic graphemes, where virama (halant) will link character clusters together, and join controls can affect the display.

Link × IndicBase  (4)
Link Join_Control × IndicBase  (5)

Do not break Hangul syllable sequences.

L × ( L | V | LV | LVT ) (6)
( LV | V ) × ( V | T ) (7)
( LVT | T) × T  (8)

If none of the above are true, break after all characters.

Any ÷ (9)