L2/11-003 Title: An Alias for Mark Source: Ken Whistler Date: January 11, 2011 Action: For review by UTC Background There is a small inconsistency between the list of combined General Category aliases that we have formally defined in PropertyValueAliases.txt and their actual usage in other contexts. The combined aliases do not represent primitive property values for General Category, but instead certain useful unions of those property values. Most are single-character aliases for the major character categories that General Category subdivides into further subtypes. The exact list of these currently defined in PropertyValueAliases.txt is: gc ; C ; Other # Cc | Cf | Cn | Co | Cs gc ; L ; Letter # Ll | Lm | Lo | Lt | Lu gc ; LC ; Cased_Letter # Ll | Lt | Lu gc ; M ; Mark # Mc | Me | Mn gc ; N ; Number # Nd | Nl | No gc ; P ; Punctuation ; punct # Pc | Pd | Pe | Pf | Pi | Po | Ps gc ; S ; Symbol # Sc | Sk | Sm | So gc ; Z ; Separator # Zl | Zp | Zs The problem arises from the fact that for gc=M, our usual practice is to refer to "Combining Mark", rather than just "Mark". Ordinarily this would be just a curiosity, but there is a formal issue related to our statement of normative definitions in Chapter 3 of the standard. Section 3.6 Combination, D52 defines "combining character" as follows: D52 Combining character: A character with the General Category of Combining Mark (M). And D50 and D51 use similar wording. The problem is that we don't actually define "Combining Mark" as a formal value for General Category anywhere -- all we define is the alias "Mark". Rather than touch the normative definitions in Chapter 3 to fix this, I think the easiest and most obvious fix is to simply add "Combining Mark" as another alias for gc=M in PropertyValueAliases.txt. There is an additional reason to do this now, because the IDNA documentation is somewhat sloppy about its use of terms "combining character" and "combining mark", creating some confusion for IDNA implementers. The fact that our own formal definition of "combining character" uses "combining mark" as an undefined term hasn't been helping. Proposal Add "Combining Mark" as a second alias for gc=M in PropertyValueAliases.txt for Unicode 6.1. Additional Issue for UAX #44 The set of combined values for General Category is not adequately documented anywhere outside of PropertyValueAliases.txt itself. Nor is it clear to everyone that their status is distinct from the basic General Category values, since they represent convenient aggregations of unions of values, rather than simple values on their own. To address this I suggest that for Unicode 6.1 we add relevant documentation to UAX #44, in Section 5.7.1 General Category Values, explaining the distinction clearly, and then adding an explicit table of the combined values that are actually defined in PropertyValueAliases.txt. That will make it easier for implementers of regex engines and the like to understand these values. .