L2/09-248




Subject: Issues With the new casing related properties in the beta
	 (DerivedCorePorperties.txt)
From: Asmus Freytag with input from Ken Whistler and Mark Davis
Date: July 13, 2009
Replaces: L2/09-238

For consideration by UTC

This is a proposal to rename several character properties introduced in
5.2.0 beta and to amend the names of string properties in TUS.
This is a refinement of a proposal in document 09/238

1. Fix naming of several string properties

Instead of the existing names for string properties defined in chapter 3
     isLowerCase()
   isUpperCase()
   isTitleCase()
   isCasefolded()

use these names
   isLowerCaseString()
   isUpperCaseString()
   isTitleCaseString()
   isCaseFoldedString()

This change makes obvious the nature of these properties as *string* properties.
This is important, because once outside the context of the specific section in
chapter 3 of the Standard, users are likely to be confused whether strings or
characters are intended.

API names, such as "isLowerCase" commonly exist in implementations and they are
not limited to string properties.

2. Fix naming of several new character properties

Instead of the Boolean character properties in DerivedCoreProperties.txt

   IsLowerCase
   IsUpperCase
   IsTitleCase
   IsCasefolded

and listing only those values for which these properties are FASLE as is done
in the BETA, use the corresponding properties

    Changes_When_Lowercased
    Changes_When_Uppercased
    Changes_When_Titlecased
    Changes_When_Casefolded

and list the values for which they are TRUE.

The advantage of using these names is that they captures the selection criteria
succinctly, making them readily and unambiguously understandable, even without
immediate access to the definitions in chapter 3. It is also directly apparent
how these properties work as *character* properties.

They are defined in the sense of "Changes_when" to keep the listing small while
sticking to the convention of listing only "true" values for Booleans in the data
files. (The actual ranges of listed values would not change)

If these changes are adopted, they work out as follows for these exmaples:

For "M", Lowercase=F, Uppercase=T,          Changes_When_Lowercased=T, Changes_When_Uppercased=F
      Changes_When_TitleCased=F
         For "m", Lowercase=T, Uppercase=F,          Changes_When_Lowercased=F, Changes_When_Uppercased=T
      Changes_When_TitleCased=T

For "ǉ" Lowercase=T, Uppercase=F,          Changes_When_Lowercased=F, Changes_When_Uppercased=T             Changes_When_TitleCased=T

For "2", Lowercase=F, Uppercase=F,          Changes_When_Lowercased=F, Changes_When_Uppercased=F
      Changes_When_TitleCased=F

For "ǈ",  Lowercase=T, Uppercase=F,  GC=TitleCase_Letter
         Changes_When_Lowercased=F, Changes_When_Uppercased=T
      Changes_When_TitleCased=F

and so on, where Lowercase and Uppercase are existing character properties.

The relation to the string properties is straightforward and without surprises,
even if the naming doesn't correspond as closely as it did in the BETA.

This relations can be stated as follows:

Changes_When_Lowercased ==> this character cannot occur in lowercase strings
Changes_When_Uppercased ==> this character cannot occur in uppercase strings
Changes_When_Titlecased ==> this character cannot occur in titlecase strings (initial position)
Changes_When_Casefolded ==> this character cannot occur in casefolded strings

These relations should be added as comments resp. annotations to the data files and chapter 3.

The following examples of strings and string properties in question make that clear:

  1. isUppercaseString: "MARK DAVIS?", "ǇO LE", "2 BE OR NOT 2 BE"
  2. isTitlecaseString: "Mark Davis?", "ǈo Le", "2 be or not 2 be"
  3. isLowercaseString: "mark davis?", "ǉo le", "2 be or not 2 be"

Note that the Unicode definition of an uppercase string consists of uppercase
characters and characters that don't change case like punctuation and characters
from non-cased scripts like Han, Hangul and Hieroglyphs. The "?" and "2" are
standins for these characters in the examples.

3. Places where these changes need to be propagated to

Note that in addition to changes to the DerivedCoreProperties.txt
file, this proposal also impact the Chapter 3 text on default case operation,
the XML generation, as well as the text of UAX #44 text and UAX #42.

[end]