L2/08-431 Title: General_Category Change for U+2071, U+207F Date: December 17, 2008 Source: Ken Whistler Action: For consideration by UTC/L2 Background In recent unicore discussion about a potential proposal for encoding a modifier letter i and a modifier letter n, it became clear that the existing characters: U+2071 SUPERSCRIPT LATIN SMALL LETTER I age=3.2, gc=Ll U+207F SUPERSCRIPT LATIN SMALL LETTER N age=1.1, gc=Ll were also intended as the modifier letters in question. In fact, this issue was raised and discussed by the UTC a number of years ago, both when U+2071 was encoded for Unicode 3.2 and when the phonetic extensions for UPA were encoded for Unicode 4.0. So it is clear to everyone that no separate encoding of a modifier letter i or a modifier letter n is warranted. However, there is an inconsistency related to the General_Category for these two characters. These two Latin superscripted characters (used as modifier letters as well as compatibility characters for old character sets, where they had marginal use in math notation) have gc=Ll, whereas all *other* Latin superscripted characters used as modifier letters have gc=Lm. Furthermore, the general position in the past has been to consider the term "modifier letter" as specifically applying to gc=Lm characters, regardless of their exact Unicode names. For example, there are modifier letters that don't have "MODIFIER LETTER" in their names, and indeed all the subscripted Latin modifier letters are named "LATIN SUBSCRIPT SMALL LETTER ...". The inconsistency in General_Category assignment, on the other hand, applies only to the two particular examples named "LATIN SUPERSCRIPT SMALL LETTER ...", instead of "MODIFIER LETTER ...". As of Unicode 5.1, the text in the standard about modifier letters has been updated to make it clearer that a character doesn't have to be called "MODIFIER LETTER" to be a modifier letter, but the alternative is basically then to depend on the General_Category=Lm value to define them. This doesn't work in the case of U+2071 and U+207F, which are thus the two exceptions that stick out. This is a very longstanding inconsistency, for U+207F, in particular, dating all the way back to the Unicode 2.0 data files. Proposal I suggest that for Unicode 5.2, the UTC update the General_Category values of U+2071 and U+207F: gc=Ll --> Lm Analysis This change affects a couple of very longstanding General_Category assignments. However, there don't seem to be any derived property values that would be disrupted by this change. Also, since neither U+2071 nor U+207F has ever had any case mappings, this does not impact casing, case mapping or case folding in any way. Both characters would continue to be Lowercase=True, regardless of the General_Category change. So it doesn't seem to violate any stability guarantee, while marginally improving the consistency of the modifier letter category. Argument For The primary reason for doing this would be to align the functional category of modifier letters better with the formal property value gc=Lm. This would remove any doubt or confusion about the status of U+2071 and U+207F as usable in phonetic orthographic contexts alongside other modifier letters, and would enable adding appropriate annotations in the text and names list to better identify these characters as to their intended use. Argument Against Let sleeping dogs lie. In the case of U+207F the General_Category assignment has been in place for 12 years now, without causing implementation problems. We could just annotate our way around the modifier letter inconsistency instead of actually changing any properties. This would be simpler to do, and keeping General_Category property values stable may be more important than continuing to jigger them to try to make them all consistent.