L2/11-282 Title: Grapheme_Base and Grapheme_Extend Documentation in UAX #44 Source: Ken Whistler Date: July 20, 2011 Action: For consideration by the UTC Background Recently a reviewer turned up an anomaly in the documentation of Grapheme_Base and Grapheme_Extend in the main property table in UAX #44. The description field for those two properties says: "For more information, see Unicode Standard Annex #29, Unicode Text Segmentation..." However, in fact neither of those properties is mentioned at all in UAX #29. The anomaly is actually a longstanding one. The original intent of Grapheme_Base and Grapheme_Extend was to participate in the segmentation rules for grapheme clusters. They were mentioned in early drafts of what became UAX #29, but they are gone from the text by the time of the first approved version of UAX #29 for Unicode 4.0. Their function for UAX #29 was replaced by the Grapheme_Cluster_Break property. One possible approach to fix this anomaly would be to deprecate Grapheme_Base and Grapheme_Extend, and then work through all the required textual changes that would be required for UAX #44 and elsewhere. However, the problem with that approach is that in the meantime, Grapheme_Base and Grapheme_Extend were picked up and used as part of the normative definitions for grapheme base and grapheme extender in Chapter 3. That means that the textual changes would be rather substantial, and would impact text that was carefully constructed as recently as Unicode 5.1 for Chapter 3. I advocate a simpler and less disruptive approach to correcting the anomaly in the documentation of these two properties in UAX #44. Proposal 1. In UAX #44, change the status of both Grapheme_Base and Grapheme_Extend from informative to normative. This change reflects that fact that both of these properties are currently used in formal definitions in Chapter 3. 2. Change the description in UAX #44 for Grapheme_Base to something like: Property used in the definition of "Grapheme base". See D58 in Chapter 3. 3. Change the description in UAX #44 for Grapheme_Extend to something like: Property used in the definition of "Grapheme extender". See D59 in Chapter 3. Note that the set of characters for which Grapheme_Extend=Yes is equivalent to the set of characters for which Grapheme_Word_Break=Extend. [And remove the Note about the treatment of gc=Co, which properly belongs in UAX #29, and not here.] Additional Note This proposal would only affect the two table entries for Grapheme_Base and Grapheme_Extend in UAX #44, and would not affect Chapter 3 or the UCD property files at all. Mark has separately suggested that the set of characters defined as Grapheme_Cluster_Break=Control could be adjusted somewhat to deal with the aberrant edge cases, and to let gcb align better with the set of characters that are Grapheme_Base=Yes. But that is a separate suggestion which can then be taken up independently of this proposed fix for the UAX #44 text.