|Re:||Default Grapheme Cluster Boundaries|
There were some errors in the draft Unicode 3.2 Default Grapheme Cluster Boundaries table. Here is a proposal to address them, as a basis for discussion at the UTC meeting. The proposal also requires some changes to the data backing the categories Link, Extend, and Base, so those classes would need to be regenerated.
Note: In the past, we have tended to make the categories (except Any) be disjoint. The following table doesn't do that, to make the rules simpler.
Note: We could make the grapheme cluster a purely pairwise determination (a good thing), if we were willing to make the following change:
Link × ( IndicBase | Join_Control ) (4') Join_Control × IndicBase (5')
In the absence of a Link what that would do is make a Join_Control belong to the following Indic grapheme cluster. It should not have any effect in practice on well-formed Indic.
Note: Ideally, we would have another derived property, IndicBase. However, if we wanted to avoid that at this date we could have the textual specification and leave the actual construction to implementers.
Table 5-3. Default Grapheme Cluster Boundaries
CR Carriage Return LF Line Feed CGJ Combining Grapheme Joiner Join_Control
Join_Control, as determined by the UCD.
Grapheme_Link, as determined by the UCD. Includes most viramas but not the grapheme joiner.
Grapheme_Extend, as determined by the UCD. Includes combining marks, all characters in Link, the CGJ, format controls (including Join_Control), and variation selectors.
Grapheme_Base, as determined by the UCD. Also includes L, V, T, LV, LVT.
IndicBase Base characters from any Indic script that has a character in Link. L Hangul leading jamo U+1100..U+115F V Hangul vowel jamo U+1160..U+11A2 T Hangul trailing jamo U+11A8..U+11F9 LV Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V> LVT Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V,T> Any Any character (includes all of the above)
Do not break between a CR and LF
CR × LF (1)
Do not break between a base character and a combining mark, or within a sequence of combining marks.
( Base | Extend ) × Extend (2)
Do not break between a CGJ and a base letter.
CGJ × Base (3)
Do not break between link characters and base characters. Do not break around a join control if it is preceded by a link and followed by a base. These rules provide for Indic graphemes, where virama (halant) will link character clusters together, and join controls can affect the display.
Link × IndicBase (4) Link Join_Control × IndicBase (5)
Do not break Hangul syllable sequences.
L × ( L | V | LV | LVT ) (6) ( LV | V ) × ( V | T ) (7) ( LVT | T) × T (8)
If none of the above are true, break after all characters.
Any ÷ (9)