Logo Unicode Technical Report #21

Case Mappings

Revision 3.0
Authors Mark Davis (mark.davis@us.ibm.com)
Date 1999-11-03
This Version http://www.unicode.org/unicode/reports/tr21/tr21-3
Previous Version http://www.unicode.org/unicode/reports/tr21/tr21-2
Latest Version http://www.unicode.org/unicode/reports/tr21

Summary

This document presents implementation guidelines for case operations: case conversion, case detection, and caseless matching.

Status

This document contains informative material which has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the Standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.

Contents


1 Introduction

Case is a normative property of characters in specific alphabets (Latin, Greek, Cyrillic, Armenian, and archaic Georgian) whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lower­case letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Alphabets with case differences are called bicameral; those without are called unicameral.

Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z. The three case forms are UPPERCASE, Titlecase, lowercase. Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian.

The case mappings in the Unicode Character Database are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character (and is usually, but not always, derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name), but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

The Unicode Character Database contains two files with case mapping information. For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about locale-specific case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt, which has been added to the Unicode Character Database starting with the 2.1.8 update. SpecialCasing.txt contains additional informative case mappings that either are not one-to-one or are context-sensitive. This data should be included when implementing case operations.

1.2 Reversibility

It is important to note that casing operations do not always provide a round-trip mapping. For example,

upper(lower(“John Brown”)) → “JOHN BROWN”

lower(upper(“John Brown”)) → “john brown”.

There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. There are also single characters that do not have reversible mappings, such as the sigmas above. Since many characters are really caseless (most of the IPA block, for example) and have no matching uppercase, uppercasing a string does not mean that it will no longer contain any lowercase letters.

For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and return to it in the sequence of keys. The user interface would produce the following results in response to a series of command-keys. Notice that the original string is restored every fourth time.

  1. The quick brown

  2. THE QUICK BROWN

  3. the quick brown

  4. The Quick Brown

  5. The quick brown (repeating from here on)

Uppercase, titlecase, and lowercase can be represented in a word processor by using a character style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.

2 Guidelines

There are a number of fine points in case operations that programmers need to be aware of in doing case conversion, case detection, and caseless matching. In all of the guidelines given below:

2.1 Case Conversion of Strings

Converting to Uppercase

Map each character to its uppercase.

Converting to Lowercase

Map each character to its lowercase.

Converting to Titlecase

Map each character to its titlecase or lowercase. If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase mapping (in most cases, this will be the same as the uppercase, but not always).

2.2 Case Detection for Strings

Detecting Uppercase

A string is uppercase if both the following conditions are true:

Detecting Lowercase

A string is lowercase if both the following conditions are true:

Detecting Titlecase

A string is titlecase if all four of the following conditions are true:

2.3 Caseless Matching

Caseless matching is commonly implemented using case-folding. The latter is the process of mapping strings to a normalized form where case differences are erased. Case-folding allows for fast caseless matches in lookups. Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Where locale-sensitive case matching is used, this information can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, in most environments, such as in file systems, text is not tagged with locale information. In such cases, the locale-specific mappings should not be used. Otherwise data structures, such as B-trees, might be built based on one set of case foldings, and used based on a different set, which will cause the B-trees to become corrupt. For those environments, a constant, locale-independent case folding should be used.

The CaseFolding.txt file can be used for doing such a locale-independent case folding. This file was generated from the Unicode Character Database, using both the one-to-one mappings and the one-to-many mappings. It folds all characters having different case forms together into a common form. To compare two strings for caseless matching, you can fold each string using this data, and then use a binary comparison.

Generally, where case distinctions are not important, other distinctions between Unicode characters (in particular, compatibility distinctions) are ignored as well. In such circumstances, text can be normalized to Normalization Form KC or KD after case-folding, to produce a normalized form that erases both compatibility distinctions and case distinctions. (See UTR #15: Unicode Normalization Forms for more information.)


Copyright © 1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.