L2/08-431
                                          

Title: General_Category Change for U+2071, U+207F
Date:  December 17, 2008
Source: Ken Whistler
Action: For consideration by UTC/L2


Background

In recent unicore discussion about a potential proposal for
encoding a modifier letter i and a modifier letter n,
it became clear that the existing characters:

U+2071 SUPERSCRIPT LATIN SMALL LETTER I

       age=3.2, gc=Ll
       
U+207F SUPERSCRIPT LATIN SMALL LETTER N

       age=1.1, gc=Ll
       
were also intended as the modifier letters in question.
In fact, this issue was raised and discussed by the UTC a
number of years ago, both when U+2071 was encoded for
Unicode 3.2 and when the phonetic extensions for UPA were
encoded for Unicode 4.0.

So it is clear to everyone that no separate encoding of a
modifier letter i or a modifier letter n is warranted.

However, there is an inconsistency related to the
General_Category for these two characters. These two
Latin superscripted characters (used as modifier letters
as well as compatibility characters for old character
sets, where they had marginal use in math notation)
have gc=Ll, whereas all *other* Latin superscripted characters
used as modifier letters have gc=Lm.

Furthermore, the general position in the past has been
to consider the term "modifier letter" as specifically
applying to gc=Lm characters, regardless of their exact
Unicode names. For example, there are modifier letters
that don't have "MODIFIER LETTER" in their names, and
indeed all the subscripted Latin modifier letters are
named "LATIN SUBSCRIPT SMALL LETTER ...". The inconsistency
in General_Category assignment, on the other hand, applies
only to the two particular examples named "LATIN SUPERSCRIPT
SMALL LETTER ...", instead of "MODIFIER LETTER ...".

As of Unicode 5.1, the text in the standard about modifier
letters has been updated to make it clearer that
a character doesn't have to be called "MODIFIER LETTER" to
be a modifier letter, but the alternative is basically then
to depend on the General_Category=Lm value to define them.
This doesn't work in the case of U+2071 and U+207F,
which are thus the two exceptions that stick out. 

This is a very longstanding inconsistency, for U+207F,
in particular, dating all the way back to the Unicode 2.0
data files.


Proposal

I suggest that for Unicode 5.2, the UTC update the General_Category
values of U+2071 and U+207F: gc=Ll --> Lm


Analysis

This change affects a couple of very longstanding General_Category
assignments. However, there don't seem to be any derived
property values that would be disrupted by this change.
Also, since neither U+2071 nor U+207F has ever had any
case mappings, this does not impact casing, case mapping or
case folding in any way. Both characters would continue to
be Lowercase=True, regardless of the General_Category change.
So it doesn't seem to violate any stability guarantee,
while marginally improving the consistency of the modifier
letter category.


Argument For

The primary reason for doing this would be to align the
functional category of modifier letters better with the
formal property value gc=Lm. This would remove any doubt
or confusion about the status of U+2071 and U+207F as
usable in phonetic orthographic contexts alongside other
modifier letters, and would enable adding appropriate
annotations in the text and names list to better
identify these characters as to their intended use.


Argument Against

Let sleeping dogs lie. In the case of U+207F the General_Category
assignment has been in place for 12 years now, without
causing implementation problems. We could just annotate
our way around the modifier letter inconsistency instead
of actually changing any properties. This would be
simpler to do, and keeping General_Category property
values stable may be more important than continuing to
jigger them to try to make them all consistent.