Date

L2/08-026

Date: Wed, 16 Jan 2008
From: Mark Davis
Subject: Case Gotchas

========

We have really 3 distinct, different notions of lowercase, and I think we ought to clarify them since they are easy for people to stumble over.

A. Here are the three different notions:

Lowercase_Letter - the general category
Lowercase - the binary property
isLowercase - the binary string function (from Section 3.13) -- but also applies to code points (as strings of length 1)

The fact that the #1 differs from the others is not a big deal -- we all know the limitations of the general category. However, it's definitely a Gotcha for the poor unsuspecting programmer who doesn't have our history.

The difference between #2 and #3 is also a bit tricky:

The Lowercase property is a question of form: does a character X have the form of a lowercase (even if it is operationally caseless: nothing lowercases to it).
The isLowercase function is an operational function -- how the character behaves in casing operations. It is defined operationally on an argument X; if you apply the toLowercase function to X, does it change?
- In particular, uncased characters qualify as Lowercase because they don't change. That's what you want in a string function: "mark davis" is lowercase even though the space is uncased. But you have to remove uncased letters to be closer to the code point property Lowercase.

Here is the difference between #2 and #3 (after subtracting uncased):

http://unicode.org/cldr/utility/unicodeset.jsp?a=[:Lowercase:]&b=[[:isLowercase:]-[:^isCased:]]

One good thing is that they aren't just disjoint: Lowercase is a superset, containing 816 Code Points that are lowercase in form, but functionally uncased.

For comparison, here is the difference between Lowercase and Lowercase_Letter.

http://unicode.org/cldr/utility/unicodeset.jsp?a=[:Lowercase:]&b=[:Lowercase_Letter:]

There are 157 Code Points in Lowercase=True that are not General_Category=Lowercase_Letter. These include Modifier letters, Roman Numerals, and Circled letters.

B. The same applies to Uppercase, although the magnitudes are smaller.

http://unicode.org/cldr/utility/unicodeset.jsp?a=[:Uppercase:]&b=[[:isUppercase:]-[: ^isCased:]]

http://unicode.org/cldr/utility/unicodeset.jsp?a=[:Uppercase:]&b=[:Uppercase_Letter:]

C. Titlecase is slightly different. There is no titlecase property, just the general category value. And because a single uppercase letter is also titlecase (operationally), the appropriate comparison is to (isTitlecase - isUppercase). Those two sets match exactly, as seen here:

http://unicode.org/cldr/utility/unicodeset.jsp?a=[:Titlecase_Letter:]&b=[[:istitlecase:]-[:isUppercase:]]

D. Proposal.

1. My thinking is that we should at least have some FAQs on this issue; I'm thinking in http://unicode.org/faq/casemap_charprop.html .

2. There is one other possible change we might consider, in Unicode 5.1. Since the isLowercase function is the result of applying toLowercase, we might make the slight change to Section 3.13 emphasizing that it is the result of an operation.

D124 isLowercase => isLowercased
D125 isUppercase => isUppercased
D126 isTitlecase => isTitlecase d

D127 is ok as it is, since we already use the past participle: isCaseFolded.

On the other hand, that might be too subtle to be worth the effort!

====

Another quirk worth documenting in the names is that some letters are called CAPITAL, but are Lowercase.

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:lowercase:]%26[:name=/CAPITAL/:]]

Even taking out the "SMALL CAPITAL"s, we are left with 23 such cases that are Lowercase=True. (Luckily, the General Category is not Lowercase_Letter.)


	
	U+1D2C

( ᴬ ) MODIFIER LETTER CAPITAL A


	
	U+1D2D

( ᴭ ) MODIFIER LETTER CAPITAL AE


	
	U+1D2E

( ᴮ ) MODIFIER LETTER CAPITAL B


	
	U+1D2F

( ᴯ ) MODIFIER LETTER CAPITAL BARRED B


	
	U+1D30

( ᴰ ) MODIFIER LETTER CAPITAL D


	
	U+1D31

( ᴱ ) MODIFIER LETTER CAPITAL E


	
	U+1D32

( ᴲ ) MODIFIER LETTER CAPITAL REVERSED E


	
	U+1D33

( ᴳ ) MODIFIER LETTER CAPITAL G


	
	U+1D34

( ᴴ ) MODIFIER LETTER CAPITAL H


	
	U+1D35

( ᴵ ) MODIFIER LETTER CAPITAL I


	
	U+1D36

( ᴶ ) MODIFIER LETTER CAPITAL J


	
	U+1D37

( ᴷ ) MODIFIER LETTER CAPITAL K


	
	U+1D38

( ᴸ ) MODIFIER LETTER CAPITAL L


	
	U+1D39

( ᴹ ) MODIFIER LETTER CAPITAL M


	
	U+1D3A

( ᴺ ) MODIFIER LETTER CAPITAL N


	
	U+1D3B

( ᴻ ) MODIFIER LETTER CAPITAL REVERSED N


	
	U+1D3C

( ᴼ ) MODIFIER LETTER CAPITAL O


	
	U+1D3D

( ᴽ ) MODIFIER LETTER CAPITAL OU


	
	U+1D3E

( ᴾ ) MODIFIER LETTER CAPITAL P


	
	U+1D3F

( ᴿ ) MODIFIER LETTER CAPITAL R


	
	U+1D40

( ᵀ ) MODIFIER LETTER CAPITAL T


	
	U+1D41

( ᵁ ) MODIFIER LETTER CAPITAL U


	
	U+1D42

( ᵂ ) MODIFIER LETTER CAPITAL W

--
Mark