L2/08-157 - Additional Derived Properties

L2/08-157

From: Mark Davis
Date: Mon, Apr 14, 2008
Subject: Additional Derived Properties

This is a proposal for additional derived properties, as described below.

Background

A. We had a problem at the very last minute in the U5.1 release, one that happened to be caught by some tests in ICU. The case-ignorable property defined in D121 broke because it depends on the WordBreak property value MidLetter, which broke into two values in order to handle numbers. It was fixed to read:

D121 A character C is defined to be case-ignorable if C has the value MidLetter or the value MidNumLet for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk).

This exposes a more general issue. When we break a property value into two, we have to check any derived properties to see whether the new value should be added in. Normally, this gets handled in the review of the properties -- since a change in derived properties shows up in the data tables -- but case-ignorable is not part of the UCD, and thus was not checked. Most other definitions in Chapter 3 that could correspond to a single formal UCD property or single property value are defined as such (eg LVT). The exceptions are:

D51 Base character: Any graphic character except for those with the General Category of
Combining Mark (M).

D53 Nonspacing mark: A combining character with the General Category of Nonspacing
Mark (Mn) or Enclosing Mark (Me).

D105 Fixed position class: A subset of the range of numeric values for combining classes�
specifically, any value in the range 10..199.

For the above, we don't need to do anything.

D120 A character C is defined to be cased if and only if C has the Lowercase or Uppercase
property or has a General_Category value of Titlecase_Letter.

D121 (above)

For these two, we do need one.

B. There is one related issue: we define various string functions which can also be meaningfully and usefully applied to single code points: these include isNFC (and other normalization forms), and the casing forms:

D124 isLowercase(X): isLowercase(X) is true when toLowercase(Y) = Y.
� For example, isLowercase("combining mark") is true, and isLowercase("Com-
bining mark") is false.

D125 isUppercase(X): isUppercase(X) is true when toUppercase(Y) = Y.
� For example, isUppercase("COMBINING MARK") is true, and isUpper-
case("Combining mark") is false.

D126 isTitlecase(X): isTitlecase(X) is true when toTitlecase(Y) = Y.
� For example, isTitlecase("Combining Mark") is true, and isTitlecase("Combin-
ing mark") is false.

D127 isCasefolded(X): isCasefolded(X) is true when toCasefold(Y) = Y.
� For example, isCasefolded("heiss") is true, and isCasefolded("hei�") is false.

D128 isCased(X): isCased(X) when isLowercase(X) is false, or isUppercase(X) is false, or
isTitlecase(X) is false.
� Any string that is not isCased consists entirely of characters that do not case
map to themselves.
� For example, isCased("abc") is true, and isCased("123") is false.

Various protocols (like IDNA) need to refer to the values of these, and normalization properties (isNFC, isNFKC, isNFKD, isNFD), and the combinations of normalization with folding, like isNFKCAndCasefolded(x). The isNFx properties are already in the UCD with the QuickCheck values (although their naming obscures that: isNFC != Yes is expressed as !NFKC_Quick_Check != No -- there is a difference because of the Maybe value).

Note that we went to some effort in U5.1 to clarify that Lowercase property value true is not the same as isLowercase(x) != x. The closest approximation for Lowercase would be isLowercase&isCased, but that also differs by hundreds of characters. See Clarification of Lowercase and Uppercase in http://www.unicode.org/versions/Unicode5.1.0/

Proposal

My strawman proposal is to add derived properties for the following.

Cased (D120)
Case_Ignoreable (D121)

Operationally_Lowercased (D124: cp = toLowercase(cp))
Operationally_Uppercased (D125...)
Operationally_Titlecased (D126...)
Operationally_Casefolded (D127...)
Operationally_Cased (D128...)
NFKC_And_Casefolded (cp = toNKFC(toCaseFolded(cp))

The names are provisional - it might be better to give the second group names that indicate that a code point is unaffected by the operation, but I couldn't think of pithy names for that. Suggestions are welcome. Note also that these are formally code point properties -- they are true of some code points that are not characters. Their inverses (eg, /Not_Operationally_Lowercased/) are character properties: only true of encoded characters.

While it is possible to have different combinations of normalization with case folding, we only need the one above for now.