DRAFT Unicode Technical Report #21
Case Mappings

Revision 2
Authors Mark Davis
Date 1999-06-23
This Version http://www.unicode.org/unicode/reports/tr21/tr21-2.html
Previous Version n/a
Latest Version http://www.unicode.org/unicode/reports/tr21
Unicode Technical Reports http://www.unicode.org/unicode/techreports.html

Summary

This document presents guidelines for case operations: case conversion, case detection, and caseless matching.

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative, or as normative specification. Please mail corrigenda and other comments to Unicore@Unicode.org.


Introduction

The case mappings in the Unicode Character Database are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character (and is usually, but not always, derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name), but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about locale-specific case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt, which has been added starting with the 2.1.8 update to the Unicode data files. SpecialCasing.txt contains additional informative case mappings that are either not one-to-one or which are context-sensitive. This data should be included when implementing case operations.

Guidelines

There are a number of fine points in case operations that programmers need to be aware of, especially with regard to titlecases. In all of the guidelines given below:

Case Conversion of Strings

toUppercase

Map each character to its uppercase.

toLowercase

Map each character to its lowercase.

toTitlecase

Map each character to its titlecase or lowercase. If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase mapping (in most cases, this will be the same as the uppercase, but not always).

Case Detection for Strings

isLowercase

is true if both the following conditions are true:

isUppercase

is true if both the following conditions are true:

isTitlecase

is true if all four of the following conditions are true:

Caseless Matching

Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Generally, where case distinctions are not important, other distinctions between Unicode characters are ignored as well. This can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, there are many circumstances where a locale-insensitive loose match is required. The best way to do this is by using the following test:

toUppercase(canonicalDecomposition(a)) == toUppercase(canonicalDecomposition(b))

However, if the strings are in a normalized form (see UTR #15: Unicode Normalization Forms), a simpler test can be used:

toUppercase(a) == toUppercase(b)

Neither of these take account of locale variations (e.g. Turkish). If a locale-insensitive match is required, then the recommendation is to use a slightly looser match by also mapping 0130 "capital i with dot to 0049 "I" capital i. In this case, the distinction among the dotted and undotted I's is lost.


Copyright

Copyright © 1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.