Date: Wed, 16 Jan 2008
From: Mark Davis
Subject: Case Gotchas
We have really 3 distinct, different notions of lowercase, and I think we
ought to clarify them since they are easy for people to stumble over.
A. Here are the three different notions:
- Lowercase_Letter - the general category
- Lowercase - the binary property
- isLowercase - the binary string function (from Section 3.13) -- but
also applies to code points (as strings of length 1)
The fact that the #1 differs from the others is not a big deal -- we all
know the limitations of the general category. However, it's definitely a
Gotcha for the poor unsuspecting programmer who doesn't have our history.
The difference between #2 and #3 is also a bit tricky:
- The Lowercase property is a question of form: does a
character X have the form of a lowercase (even if it is
operationally caseless: nothing lowercases to it).
- The isLowercase function is an operational function -- how
the character behaves in casing operations. It is defined operationally
on an argument X; if you apply the toLowercase function to X, does it
- In particular, uncased characters qualify as Lowercase because
they don't change. That's what you want in a string function: "mark
davis" is lowercase even though the space is uncased. But you have
to remove uncased letters to be closer to the code point property
Here is the difference between #2 and #3 (after subtracting uncased):
One good thing is that they aren't just disjoint: Lowercase is a superset,
containing 816 Code Points that are lowercase in form, but functionally
For comparison, here is the difference between Lowercase and
There are 157 Code Points in Lowercase=True that are not
General_Category=Lowercase_Letter. These include Modifier letters, Roman
Numerals, and Circled letters.
B. The same applies to Uppercase, although the magnitudes are smaller.
C. Titlecase is slightly different. There is no titlecase property, just the
general category value. And because a single uppercase letter is also
titlecase (operationally), the appropriate comparison is to (isTitlecase -
isUppercase). Those two sets match exactly, as seen here:
1. My thinking is that we should at least have some FAQs on this issue; I'm
2. There is one other possible change we might consider, in Unicode 5.1.
Since the isLowercase function is the result of applying toLowercase, we
might make the slight change to Section 3.13 emphasizing that it is the
result of an operation.
D124 isLowercase => isLowercased
D125 isUppercase => isUppercased
D126 isTitlecase => isTitlecase d
D127 is ok as it is, since we already use the past participle:
On the other hand, that might be too subtle to be worth the effort!
Another quirk worth documenting in the names is that some letters are called
CAPITAL, but are Lowercase.
Even taking out the "SMALL CAPITAL"s, we are left with 23 such cases that
are Lowercase=True. (Luckily, the General Category is not Lowercase_Letter.)
( ᴬ ) MODIFIER LETTER CAPITAL A
( ᴭ ) MODIFIER LETTER CAPITAL AE
( ᴮ ) MODIFIER LETTER CAPITAL B
( ᴯ ) MODIFIER LETTER CAPITAL BARRED B
( ᴰ ) MODIFIER LETTER CAPITAL D
( ᴱ ) MODIFIER LETTER CAPITAL E
( ᴲ ) MODIFIER LETTER CAPITAL REVERSED E
( ᴳ ) MODIFIER LETTER CAPITAL G
( ᴴ ) MODIFIER LETTER CAPITAL H
( ᴵ ) MODIFIER LETTER CAPITAL I
( ᴶ ) MODIFIER LETTER CAPITAL J
( ᴷ ) MODIFIER LETTER CAPITAL K
( ᴸ ) MODIFIER LETTER CAPITAL L
( ᴹ ) MODIFIER LETTER CAPITAL M
( ᴺ ) MODIFIER LETTER CAPITAL N
( ᴻ ) MODIFIER LETTER CAPITAL REVERSED N
( ᴼ ) MODIFIER LETTER CAPITAL O
( ᴽ ) MODIFIER LETTER CAPITAL OU
( ᴾ ) MODIFIER LETTER CAPITAL P
( ᴿ ) MODIFIER LETTER CAPITAL R
( ᵀ ) MODIFIER LETTER CAPITAL T
( ᵁ ) MODIFIER LETTER CAPITAL U
( ᵂ ) MODIFIER LETTER CAPITAL W