L2/15-183R

Title: Candidate characters for Grapheme_Cluster_Break=Prepend
Author: Roozbeh Pournader (Google)
Date: July 27, 2015
Action: For consideration by the UTC

UAX #29 Unicode Text Segmentation has been supporting a Prepend class of
characters, presently with no members (it used to contain some Southeast Asian
characters).

The author is proposing that the following characters to be added to the class:


Group A: Subtending marks

U+0600 ARABIC NUMBER SIGN
U+0601 ARABIC SIGN SANAH
U+0602 ARABIC FOOTNOTE MARKER
U+0603 ARABIC SIGN SAFHA
U+0604 ARABIC SIGN SAMVAT
U+0605 ARABIC NUMBER MARK ABOVE
U+06DD ARABIC END OF AYAH
U+070F SYRIAC ABBREVIATION MARK
U+110BD KAITHI NUMBER SIGN

(The ARABIC SIYAQ NUMBER MARK, proposed in L2/15-074, would also fall into this
group.)


Group B: Indic cluster-initial consonants

U+0D4E MALAYALAM LETTER DOT REPH
U+111C2 SHARADA SIGN JIHVAMULIYA
U+111C3 SHARADA SIGN UPADHMANIYA

(These are all the characters with InSC=Consonant_Prefixed or
InSC=Consonant_Preceding_Repha. The UTC-approved Soyombo characters
U+11A84..11A87 SOYOMBO CLUSTER-INITIAL LETTER LA..SOYOMBO CLUSTER-INITIAL
LETTER RA would also fall into this group.)


Rationale: This is because all the characters above attach to the character or
characters immediately after them in a rather unseparable way (typically
subtending or enclosing them), in a way that there should not be a grapheme
break between them and the character immediately after them. In this way, they
are similar to various combining marks, such as U+20DD COMBINING ENCLOSING
CIRCLE or U+0332 COMBINING LOW LINE that form a grapheme cluster unit with a
base character. The difference is that the base character follows the above 13
characters, instead of preceding them.