Unicode Technical Report #21

Case Mappings

Revision	3.0
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	1999-11-03
This Version	http://www.unicode.org/unicode/reports/tr21/tr21-3
Previous Version	http://www.unicode.org/unicode/reports/tr21/tr21-2
Latest Version	http://www.unicode.org/unicode/reports/tr21

Summary

This document presents implementation guidelines for case operations: case conversion, case detection, and caseless matching.

Status

This document contains informative material which has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the Standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.

1 Introduction
- 1.2 Reversibility
2 Guidelines

1 Introduction

Case is a normative property of characters in specific alphabets (Latin, Greek, Cyrillic, Armenian, and archaic Georgian) whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lowercase letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Alphabets with case differences are called bicameral; those without are called unicameral.

Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z. The three case forms are UPPERCASE, Titlecase, lowercase. Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian.

The case mappings in the Unicode Character Database are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character (and is usually, but not always, derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name), but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase).
- For example, the title case of the example character is 01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original.
- For example, the German character 00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with 0149 "ŉ" latin small letter n preceded by apostrophe.
Characters may also have different case mappings, depending on the context.
- For example, 03A3 "Σ" capital sigma lowercases to 03C3 "σ" small sigma if it is followed by another letter, but lowercases to 03C2 "ς" small final sigma if it is not.
Characters may have case mappings that depend on the locale.
- For example, in Turkish the letter 0049 "I" capital letter i lowercases to 0131 "ı" small dotless i.
Case mappings are not, in general, reversible.
- For example, once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation.

The Unicode Character Database contains two files with case mapping information. For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about locale-specific case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt, which has been added to the Unicode Character Database starting with the 2.1.8 update. SpecialCasing.txt contains additional informative case mappings that either are not one-to-one or are context-sensitive. This data should be included when implementing case operations.

1.2 Reversibility

It is important to note that casing operations do not always provide a round-trip mapping. For example,

upper(lower(“John Brown”)) → “JOHN BROWN”

lower(upper(“John Brown”)) → “john brown”.

There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. There are also single characters that do not have reversible mappings, such as the sigmas above. Since many characters are really caseless (most of the IPA block, for example) and have no matching uppercase, uppercasing a string does not mean that it will no longer contain any lowercase letters.

For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and return to it in the sequence of keys. The user interface would produce the following results in response to a series of command-keys. Notice that the original string is restored every fourth time.

The quick brown

THE QUICK BROWN

the quick brown

The Quick Brown

The quick brown (repeating from here on)

Uppercase, titlecase, and lowercase can be represented in a word processor by using a character style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.

2 Guidelines

There are a number of fine points in case operations that programmers need to be aware of in doing case conversion, case detection, and caseless matching. In all of the guidelines given below:

The case mappings specified by the Unicode Character Database are in the union of the UnicodeData.txt and SpecialCasing.txt files.
Treat 0345 combining iota subscript as a lowercase letter.
For pre-Unicode 3.0 case mappings, treat all letters whose canonical decomposition ends with iota subscript as titlecase letters (these are fixed in Unicode 3.0).
A character is cased if it is marked as uppercase, lowercase, or titlecase (Lu, Ll, Lt)
For any character that is Lu, but whose uppercase mapping is not the same as its titlecase mapping, treat it as a new subtype Lud (distinct-uppercase letter) for the purposes of this document. This can be algorithmically determined from the data in the Unicode Character Database.
When looking at preceding or following letters, disregard any intervening non-spacing marks.
There are a very small number of locale-specific case mappings: at this point essentially only ones for Turkish-language locales. Implementations may choose to hard-code the locale-specific treatment of these characters rather than use table-driven code based on the SpecialCasing.txt file. The same is true of the FINAL and NON_FINAL tags.

2.1 Case Conversion of Strings

Converting to Uppercase

Map each character to its uppercase.

Remember to use special mappings (e.g. Turkish) for some locales.

Converting to Lowercase

Map each character to its lowercase.

Remember to handle FINAL and NON_FINAL mappings correctly (e.g. Greek CAPITAL SIGMA). If the letter is followed by a cased letter, chose the NON_FINAL form, otherwise chose the FINAL form.
Remember to use special mappings (e.g. Turkish) for some locales.

Converting to Titlecase

Map each character to its titlecase or lowercase. If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase mapping (in most cases, this will be the same as the uppercase, but not always).

Remember to handle FINAL and NON_FINAL mappings correctly (e.g. Greek CAPITAL SIGMA). If the letter is followed by a cased letter, choose the NON_FINAL form, otherwise choose the FINAL form.
Remember to use special mappings (e.g. Turkish) for some locales.

2.2 Case Detection for Strings

Detecting Uppercase

A string is uppercase if both the following conditions are true:

there is at least one cased character in the string
there are no titlecase or lowercase characters

Detecting Lowercase

A string is lowercase if both the following conditions are true:

there is at least one cased character in the string
there are no titlecase, uppercase, or distinct-uppercase characters

Detecting Titlecase

A string is titlecase if all four of the following conditions are true:

there is at least one cased character in the string
there are no distinct-uppercase (Lud) characters
any lowercase letters must follow cased characters
there are no titlecase or uppercase letters, except following uncased characters

2.3 Caseless Matching

Caseless matching is commonly implemented using case-folding. The latter is the process of mapping strings to a normalized form where case differences are erased. Case-folding allows for fast caseless matches in lookups. Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Where locale-sensitive case matching is used, this information can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, in most environments, such as in file systems, text is not tagged with locale information. In such cases, the locale-specific mappings should not be used. Otherwise data structures, such as B-trees, might be built based on one set of case foldings, and used based on a different set, which will cause the B-trees to become corrupt. For those environments, a constant, locale-independent case folding should be used.

The CaseFolding.txt file can be used for doing such a locale-independent case folding. This file was generated from the Unicode Character Database, using both the one-to-one mappings and the one-to-many mappings. It folds all characters having different case forms together into a common form. To compare two strings for caseless matching, you can fold each string using this data, and then use a binary comparison.

Generally, where case distinctions are not important, other distinctions between Unicode characters (in particular, compatibility distinctions) are ignored as well. In such circumstances, text can be normalized to Normalization Form KC or KD after case-folding, to produce a normalized form that erases both compatibility distinctions and case distinctions. (See UTR #15: Unicode Normalization Forms for more information.)

Copyright © 1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.