Rules for default grapheme clustering


Kent Karlsson




            Grapheme clustering must be done in a way that is independent of where in a combining sequence a Grapheme_link (combining) character occurs, since most Grapheme_links are of combining class 9, and thus are movable when doing canonical reordering.  Therefore a Grapheme_link need not be last in a combining sequence, and even if it is, it need not be the last combining character in the sequence after normalisation.

            Therefore the rules for grapheme clustering must be independent of where in a combining sequence the link character occurs, or for that matter where in a combining sequence an "enclosing+" (more than general category Me!) occurs.  The latter does (and should) however affect the "scope" for ensuing combining characters in the same grapheme cluster:  an A (an "enclosing+" character) and combining characters following an A in the same combining sequence apply to the entire preceding part of the grapheme cluster, not just the last letter of it.  Nested clustering is prohibited by the occurrence of an A breaking any further clustering.

            (Note that the grapheme clustering is often, but not always, related to collation clustering; Hangul being a major exception

            Definitions of symbols used in the rules below:



Carriage Return.


Line Feed.


Line Tabulation.


Form Feed.


Join_Control, as determined by the UCD.


Any combining mark (M&). This includes all characters in Link, variation selectors are included, as well as EnclosingCombining and NonEnclosingCombining.


Enclosing_Combining, (not yet in the UCD) as determined by the UCD.  A combining mark that is enclosing.  Includes all Me characters, and all combining Brahmic derived dependent vowels.


A combining mark that is not enclosing: all in Combining that are not in EnclosingCombining.


Grapheme_Link, as determined by the UCD.  Includes linking viramas and the combining grapheme joiner.


Logical_Order_Exception, as determined by the UCD.  Some Thai and Lao vowels.


Grapheme_Extend (MODIFIED!!), as determined by the UCD.  Lm, and some Thai and Lao vowels.  [NonCombiningExtender = Lm + 0e30 + 0e32 + 0e33 + 0e45 + 0eb0 + 0eb2 + 0eb3 + 0ebd]       (what about TAMIL SIGN VISARGA?)


Isolated_Base, (not yet in the UCD) as determined by the UCD.  Symbols and punctuation (including spaces).  [SymbolBase = P& + S& + Zs + Cn + Co + LogicalOrderException + NonCombiningExtender]


Grapheme_Base (MODIFIED!!), as determined by the UCD.  Includes neither L, V, T, LV, LVT (which are autoconjoining), symbols, or punctuation.  [LetterBase = L& + N&– L – V – T – LV – LVT – LogicalOrderException – NonCombiningExtender]


Hangul leading jamo U+1100..U+115F.


Hangul vowel jamo U+1160..U+11A2.


Hangul trailing jamo U+11A8..U+11F9.


Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V>.


Precomposed Hangul syllable that is canonically equivalent to a sequence of <L,V,T>.


Any character (includes all of the above).


            Rules for where there is no grapheme break:


Do not break between a CR and LF, VT or FF (assuming that reports 13 and 14 are revised similarly).



(LF | VT | FF)


Do not break Hangul syllable sequences.  There is no break between L and T since a minimal insertion of fillers gives L  Vf  T.  Combining characters are included at the right hand side here, since L, V, T, LV, and LVT are autoconjoining, and are therefore not included in LetterBase.



(L | V | T | LV | LVT | Combining)


(V | LV)


(V | T | Combining)


(T | LVT)


(T | Combining)


Do not break between a base character and a combining mark, or within a sequence of combining marks.

(SymbolBase | LetterBase | Combining)




Do not break (by default) between non-combining preextenders (these have the property Logical_order_exception) and a letter, not between a letter combining sequence and a non-combining postextender (letter modifiers, and some Thai and Lao vowels that would have been combining if the Brahmic script model had been followed fully).





(LetterBase  Combining*)




The following two rules apply if and only if in addition the match of NonEnclosingCombining+ contains at least one Link character (such a character is combining).

(LetterBase  NonEnclosingCombining+)




(LetterBase  NonEnclosingCombining+  JoinControl)




If none of the above is true, break after any character.







Note that a Link in a letter/digit based combining sequence makes it (the combining sequence) “conjoin” with the next letter/digit combining sequence, but that an EnclosingCombining in the combining sequence makes it non-conjoining and overrides any Link (before or after); this prevents nesting.  Note also that an EnclosingCombining character and any follow-on combining characters apply to (the preceding part of) the cluster, not just the last base in it.