L2/01-322R

Default Grapheme Clusters

M. Davis, 2001-08-17

This is an update to the document L2/01-086.

Several organizations have wished to get a definitive definition of default grapheme cluster from the consortium. We have two definitions of default grapheme clusters, one in Chapter 5 and one in UTR#24.

Note: default grapheme clusters were previously referred to as "locale-independent graphemes" in these documents. The term cluster has been added to emphasize that the term grapheme as used differently in linguistics. For simplicity and to align with UTS #10 Unicode Collation Algorithm, the terms "locale-independent" and "locale-dependent" been changed to "default" and "tailored" respectively.

However, the data supporting a definition of default grapheme clusters is not present in the UCD, and there are slight variations between these two formulations. I propose that in the next version of Unicode, we

The following presents a proposed version:

Definition

A default grapheme cluster is defined by the following regular expression. Within a string, the bounds of a default grapheme cluster are determined by the longest string of characters that match this regular expression.

GraphemeCluster ::= GraphameBase? ( GraphemeExtend | GraphemeLink Join_Control? GraphemeBase? )*

That is, a default grapheme cluster is formed from a base (if there is one), followed by zero or more continuations, where a continuation either is an extend or is a link plus optional base. A Join_Control (zero width joiner or zero width non-joiner) can also occur after a GraphemeLink. The definition captures all:

It also includes some cases where characters should have been characterized as combining, but for historical reasons are not, such as U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK. The definition is designed to be stable across canonical equivalence normalization (NFC and NFD).

As with other definitions in Chapter 5 and elsewhere, such definitions are designed to be simple to implement. They need to provide an algorithmic determination of the valid, default grapheme clusters, and exclude sequences that are normally not considered default grapheme clusters. However, they do not have to catch edge cases that will not occur in practice. Mismatched sequences such as <DEVANAGARI KA, HANGUL JONGSEONG YEORINHIEUH, COMBINING ACUTE> may end up being characterized as a single default grapheme cluster, but it is not worth the extra complications in the definition that would be required to catch all of these cases, since they will not occur in practice.

As discussed in UTR #24 and elsewhere, the definition of a default grapheme clusters is not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate: definitions that match more precisely the user expectations within individual languages. (For example, "ch" may be considered a grapheme cluster in Slovak.) It is, however, designed to provide a much more accurate match to overall user expectations for "characters" than is provided by individual Unicode code points.

Display of Grapheme Clusters. Default grapheme clusters do not necessarily reflect text display. For example, the sequence <f, i> may be displayed as a single glyph on the screen, but would still be two grapheme clusters.

Properties

# ================================================

# Binary Property

1160..11A2    ; Other_GraphemeExtend # Lo  [67] HANGUL JUNGSEONG FILLER..HANGUL JUNGSEONG SSANGARAEA
11A8..11F9    ; Other_GraphemeExtend # Lo  [82] HANGUL JONGSEONG KIYEOK..HANGUL JONGSEONG YEORINHIEUH
FF9E..FF9F    ; Other_GraphemeExtend # Lm   [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

# Total code points: 151

# ================================================

# Binary Property

094D          ; GraphemeLink # Mn       DEVANAGARI SIGN VIRAMA
09CD          ; GraphemeLink # Mn       BENGALI SIGN VIRAMA
0A4D          ; GraphemeLink # Mn       GURMUKHI SIGN VIRAMA
0ACD          ; GraphemeLink # Mn       GUJARATI SIGN VIRAMA
0B4D          ; GraphemeLink # Mn       ORIYA SIGN VIRAMA
0BCD          ; GraphemeLink # Mn       TAMIL SIGN VIRAMA
0C4D          ; GraphemeLink # Mn       TELUGU SIGN VIRAMA
0CCD          ; GraphemeLink # Mn       KANNADA SIGN VIRAMA
0D4D          ; GraphemeLink # Mn       MALAYALAM SIGN VIRAMA
0DCA          ; GraphemeLink # Mn       SINHALA SIGN AL-LAKUNA
0E3A          ; GraphemeLink # Mn       THAI CHARACTER PHINTHU
1039          ; GraphemeLink # Mn       MYANMAR SIGN VIRAMA
17D2          ; GraphemeLink # Mn       KHMER SIGN COENG

# Total code points: 13

Derived Properties

The following derived properties will be defined in terms of the general category property values and the above properties.

# GraphemeExtend := Me + Mn + Mc + Other_GraphemeExtend - GraphemeLink
# GraphemeBase := [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - GraphemeLink - GraphemeExtend

Note: the general category abbreviations used above have the following meanings.

Zl Separator, Line
Zp Separator, Paragraph
Cc Other, Control
Cf Other, Format
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned (no characters in the file have this property)