DRAFT Unicode Technical Report #21
Case Mappings

Revision	2
Authors	Mark Davis
Date	1999-06-23
This Version	http://www.unicode.org/unicode/reports/tr21/tr21-2.html
Previous Version	n/a
Latest Version	http://www.unicode.org/unicode/reports/tr21
Unicode Technical Reports	http://www.unicode.org/unicode/techreports.html

Summary

This document presents guidelines for case operations: case conversion, case detection, and caseless matching.

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative, or as normative specification. Please mail corrigenda and other comments to Unicore@Unicode.org.

Introduction

The case mappings in the Unicode Character Database are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character (and is usually, but not always, derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name), but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase).
- For example, the title case of the example character is 01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original.
- For example, the German character 00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping.
Characters may also have different case mappings, depending on the context.
- For example, 03A3 "Σ" capital sigma lowercases to 03C3 "σ" small sigma if it is not followed by another letter, but lowercases to 03C2 "ς" small final sigma if it is.
Characters may have case mappings that depend on the locale.
- For example, in Turkish the letter 0049 "I" capital letter i lowercases to 0131 "ı" small dotless i.
Case mappings are not, in general, reversible.
- For example, once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about locale-specific case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt, which has been added starting with the 2.1.8 update to the Unicode data files. SpecialCasing.txt contains additional informative case mappings that are either not one-to-one or which are context-sensitive. This data should be included when implementing case operations.

Guidelines

There are a number of fine points in case operations that programmers need to be aware of, especially with regard to titlecases. In all of the guidelines given below:

Start with the case mappings found in both the UnicodeData.txt and SpecialCasing.txt files.
Treat 0345 iota subscript as a lowercase letter.
For pre-Unicode 3.0 case mappings, treat all letters whose canonical decomposition ends with iota subscript as titlecase letters (these are fixed in Unicode 3.0).
A character is cased if it is marked as uppercase, lowercase, or titlecase (Lu, Ll, Lt)
For any character that is Lu, but whose uppercase mapping is not the same as its titlecase mapping, treat it as a new subtype Lud (distinct-uppercase letter) for the purposes of this document. This can be algorithmically determined from the data in UnicodeData.txt.
When looking at preceding or following letters, disregard any intervening non-spacing marks.

Case Conversion of Strings

`toUppercase`

Map each character to its uppercase.

Remember to use special mappings (e.g. Turkish) for some locales.

`toLowercase`

Map each character to its lowercase.

Remember to handle FINAL and NON_FINAL mappings correctly (e.g. Greek CAPITAL SIGMA). If the letter is followed by a cased letter, chose the NON_FINAL form, otherwise chose the FINAL form.
Remember to use special mappings (e.g. Turkish) for some locales.

`toTitlecase`

Map each character to its titlecase or lowercase. If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase mapping (in most cases, this will be the same as the uppercase, but not always).

Remember to handle FINAL and NON_FINAL mappings correctly (e.g. Greek CAPITAL SIGMA). If the letter is followed by a cased letter, chose the NON_FINAL form, otherwise chose the FINAL form.
Remember to use special mappings (e.g. Turkish) for some locales.

Case Detection for Strings

`isLowercase`

is true if both the following conditions are true:

there is at least one cased character in the string
there are no titlecase, uppercase, or distinct-uppercase characters

`isUppercase`

is true if both the following conditions are true:

there is at least one cased character in the string
there are no titlecase or lowercase characters

`isTitlecase`

is true if all four of the following conditions are true:

there is at least one cased character in the string
there are no distinct-uppercase (Lud) characters
any lowercase letters must follow cased characters
there are no titlecase or uppercase letters, except following uncased characters

Caseless Matching

Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Generally, where case distinctions are not important, other distinctions between Unicode characters are ignored as well. This can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, there are many circumstances where a locale-insensitive loose match is required. The best way to do this is by using the following test:

toUppercase(canonicalDecomposition(a)) == toUppercase(canonicalDecomposition(b))

However, if the strings are in a normalized form (see UTR #15: Unicode Normalization Forms), a simpler test can be used:

toUppercase(a) == toUppercase(b)

Neither of these take account of locale variations (e.g. Turkish). If a locale-insensitive match is required, then the recommendation is to use a slightly looser match by also mapping 0130 "capital i with dot to 0049 "I" capital i. In this case, the distinction among the dotted and undotted I's is lost.

Copyright

Copyright © 1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

DRAFT Unicode Technical Report #21 Case Mappings

Introduction

Guidelines

Case Conversion of Strings

toUppercase

toLowercase

toTitlecase