M. Davis, 2000-08-15
This is an update to the document L2/01-086.
Several organizations have wished to get a definitive definition of locale-independent grapheme cluster from the consortium. We have two definitions of locale-independent grapheme clusters, one in Chapter 5 and one in UTR#24 (grapheme clusters are abbreviated as simply "graphemes" in these documents. However, the data supporting a definition of graphemes is not present in the UCD, and there are slight variations between these two formulations. I propose that in the next version of Unicode, we
The following presents a proposed version:
A locale-independent grapheme is defined by the following regular expression. Within a string, the bounds of a grapheme are determined by the longest string of characters that match this regular expression.
GraphemeCluster ::= GraphameBase+ ( GraphemeExtend | GraphemeLink GraphemeBase? )*
That is, a grapheme cluster is formed from a base (if there is one), followed by zero or more continuations, where a continuation either is an extend or is a link plus optional base. The definition captures all:
It also includes some cases where characters should have been characterized
as combining, but for historical reasons are not, such as U+FF9E HALFWIDTH
KATAKANA VOICED SOUND MARK
. The definition is designed to be stable
across canonical equivalence normalization (NFC and NFD).
As with other definitions in Chapter 5 and elsewhere, such definitions are
designed to be simple to implement. They need to provide an algorithmic
determination of the valid, locale-independent grapheme clusters, and exclude
sequences that are normally not considered grapheme clusters. However, they do not
have to catch edge cases that will not occur in practice. Mismatched sequences
such as <DEVANAGARI KA, HANGUL JONGSEONG YEORINHIEUH, COMBINING
ACUTE>
may end up being characterized as a single grapheme, but it is
not worth the extra complications in the definition that would be required to
catch all of these cases, since they will not occur in practice.
As discussed in UTR #24 and elsewhere, the definition of a locale-independent grapheme clusters is not meant to exclude the use of more sophisticated definitions of locale-dependent grapheme clusters where appropriate: definitions that match more precisely the user expectations within individual languages. It is, however, designed to provide a much more accurate match to overall user expectations for "characters" than is provided by individual Unicode code points.
# ================================================ # Binary Property 1160..11A2 ; Other_GraphemeExtend # Lo [67] HANGUL JUNGSEONG FILLER..HANGUL JUNGSEONG SSANGARAEA 11A8..11F9 ; Other_GraphemeExtend # Lo [82] HANGUL JONGSEONG KIYEOK..HANGUL JONGSEONG YEORINHIEUH FF9E..FF9F ; Other_GraphemeExtend # Lm [2] HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # Total code points: 151 # ================================================ # Binary Property 094D ; GraphemeLink # Mn DEVANAGARI SIGN VIRAMA 09CD ; GraphemeLink # Mn BENGALI SIGN VIRAMA 0A4D ; GraphemeLink # Mn GURMUKHI SIGN VIRAMA 0ACD ; GraphemeLink # Mn GUJARATI SIGN VIRAMA 0B4D ; GraphemeLink # Mn ORIYA SIGN VIRAMA 0BCD ; GraphemeLink # Mn TAMIL SIGN VIRAMA 0C4D ; GraphemeLink # Mn TELUGU SIGN VIRAMA 0CCD ; GraphemeLink # Mn KANNADA SIGN VIRAMA 0D4D ; GraphemeLink # Mn MALAYALAM SIGN VIRAMA 0DCA ; GraphemeLink # Mn SINHALA SIGN AL-LAKUNA 0E3A ; GraphemeLink # Mn THAI CHARACTER PHINTHU 1039 ; GraphemeLink # Mn MYANMAR SIGN VIRAMA 17D2 ; GraphemeLink # Mn KHMER SIGN COENG # Total code points: 13
The following properties will be defined in terms of the general category property values and the above properties.
# GraphemeExtend := M* + Other_GraphemeExtend - GraphemeLink # GraphemeBase := [0..10FFFF] - C* - Z* - GraphemeLink - GraphemeExtend