Unicode Technical Report #21

Case Mappings

Version	4.3
Authors	Mark Davis ([email protected], home)
Date
This Version	http://www.unicode.org/unicode/reports/tr21/tr21-4.3
Previous Version	http://www.unicode.org/unicode/reports/tr21/tr21-4.2
Latest Version	http://www.unicode.org/unicode/reports/tr21

Summary

This document presents implementation guidelines for case operations: case conversion, case detection, and caseless matching.

Status

This document contains informative material which has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the Standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.

1 Introduction
- 1.1 Reversibility
- 1.2 Data
  - 1.2.1 Context-Dependent Mappings
2 Guidelines
Modifications

1 Introduction

Case is a normative property of characters in specific alphabets (Latin, Greek, Cyrillic, Armenian, and archaic Georgian) whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lowercase letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Alphabets with case differences are called bicameral; those without are called unicameral.

Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is: U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z.

Thus the three case forms are UPPERCASE, Titlecase, and lowercase.

Note: The term titlecase can also be used to refer to words where the first letter is an uppercase or titlecase letter, and the rest of the letters are lowercase. However, not all words in the title of a document or first words in a sentence will be titlecase.

The choice of which words to titlecase is language-dependent. For example, "Taming of the Shrew" would be the appropriate capitalization in English, not "Taming Of The Shrew". Moreover, the determination of what actually constitutes a word is also language-dependent. For example, l'arbre might be considered two words in French, while can't is considered one word in English.

Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian.

The case mappings in the Unicode Character Database (UCD) are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character, but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

In most cases, the titlecase is the same as the uppercase, but not always. For example, the titlecase of U+01F1 "DZ" capital dz is U+01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original.
- For example, the German character U+00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping, such as with U+0149 "ŉ" latin small letter n preceded by apostrophe.
There are some characters that require special handling, such as U+0345 combining iota subscript.
Characters may also have different case mappings, depending on the context.
- For example, U+03A3 "Σ" capital sigma lowercases to U+03C3 "σ" small sigma if it is followed by another letter, but lowercases to U+03C2 "ς" small final sigma if it is not.
Characters may have case mappings that depend on the locale.
- For example, in Turkish the letter U+0049 "I" capital letter i lowercases to U+0131 "ı" small dotless i.
Since many characters are really caseless (most of the IPA block, for example) and have no matching uppercase, the process of uppercasing a string does not mean that it will no longer contain any lowercase letters.

1.1 Reversibility

It is important to note that no casing operations are reversible. For example,

upper(lower(“John Brown”)) → “JOHN BROWN”

lower(upper(“John Brown”)) → “john brown”.

There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. Once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. There are also single characters that do not have reversible mappings, such as the Greek sigmas above.

For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and return to it in the sequence of keys. The user interface would produce the following results in response to a series of command-keys. Notice that the original string is restored every fourth time.

The quick brown

THE QUICK BROWN

the quick brown

The Quick Brown

The quick brown (repeating from here on)

Uppercase, titlecase, and lowercase can be represented in a word processor by using a character style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.

1.2 Data

The Unicode Character Database contains three files with case mapping information.

UnicodeData.txt	Contains the case mappings that map to a single character. These do not increase the length of strings, and do not contain context-dependent mappings. Only legacy implementations that cannot handle case mappings that increase string lengths use UnicodeData case mappings alone. The single-character mappings are insufficient for languages such as German.
SpecialCasing.txt	Contains additional case mappings that map to more than one character, such as "ß" to "SS". It also contains context-dependent mappings, with flags to distinguish them from the normal mappings. There are some characters that have a "best" single-character mapping in UnicodeData and also have a full mapping in SpecialCasing.
CaseFolding.txt	Contains data for performing locale-independent case- folding, as described in 2.3 Caseless Matching.

A set of charts that show the Unicode 3.0 case mappings in are also available online. The index page is ordered by general category and script. The codepoints are sorted by lowercased NFKC, to place related characters next to one another.

The full case mappings for Unicode characters are obtained by using the mappings from SpecialCasing plus the mappings from UnicodeData, excluding any latter mappings that would conflict. Any character that does not have a mapping in these files is considered to map to itself. In this document, the full case mappings are referred to as UCD_lower(x), UCD_title(x), and UCD_upper(x).

1.2.1 Context-Dependent Mappings

The context-dependent case mappings are used in all of these functions, although they affect very few characters:

If a mapping is locale-specific (such as Turkish), use it for that locale (unless the process is locale-independent).
If a mapping is marked by FINAL, use it when the character is not followed by a cased character.
If a mapping is marked by NON_FINAL, use it when the character is followed by an cased character.
If a mapping is marked by AFTER_i, use it when the previous base character is a lowercase i.
Otherwise, use the normal mapping.

Because there are very few context-dependent case mappings, implementations may choose to hard-code the treatment of these characters rather than use data-driven code based on the UCD. When this is done, every time the implementation is upgraded to a new version of Unicode, the code must be checked for consistency with the updated data.

2 Guidelines

There are a number of fine points in case operations that programmers need to be aware of in doing case conversion, case detection, and caseless matching. In all of the guidelines given below, when looking at preceding or following letters, disregard any intervening non-spacing marks.

Detection of case requires more than just the general category values (Lu, Lt, Ll) Each Unicode character x is assigned to one the following sets, based on the general category and case mappings.

lower:
- x is Lu, or
- x = combining iota subscript, or
- there is some y such that UCD_lower(y) = x
title:
- x is Lt or
- there is some y such that UCD_title(y) = x
upper:
- UCD_title(x) ≠ UCD_upper(x), and
  - x is Lu, or
  - there is some y such that UCD_upper(y) = x
uniqueUpper:
- UCD_title(x) = UCD_upper(x), and
  - x is Lu, or
  - there is some y such that UCD_upper(y) = x
- Example: "DZ" is uniqueUpper, while "D" is upper
uncased: every other character

A character is called cased if it is lower, title, upper, or uniqueUpper (that is, not uncased).

2.1 Case Conversion of Strings

Converting to Uppercase

Map each character x to UCD_upper(x).

Remember to use the context-dependent mappings.

Converting to Lowercase

Map each character x to UCD_lower(x).

Remember to use the context-dependent mappings above.

Converting to Titlecase

Map each character x based on the the preceding character. If that character is cased, use UCD_lower(x), otherwise UCD_title(x).

Remember to use the context-dependent mappings above, and consider the titlecase caveats.

2.2 Case Detection for Strings

Detecting Uppercase

A string is uppercase if both the following conditions are true:

there is at least one cased character in the string
there are no title or lower characters

Detecting Lowercase

A string is lowercase if both the following conditions are true:

there is at least one cased character in the string
there are no title, upper, or uniqueUpper characters

Detecting Titlecase

A string is titlecase if all of the following conditions are true:

there is at least one cased character in the string
there are no uniqueUpper characters
each character is
- title or upper, if the preceding character is uncased.
- lower, if the preceding character is cased.

See the titlecase caveats for more information.

2.3 Caseless Matching

Caseless matching is commonly implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.

Note: normally the original source string is not replaced by the folded string, since that may erase important information. For example, the name "Marco di Silva" would be folded to "marco di silva", losing the information as to which letters are capitalized. What is typically done is that the original string is stored along with a case-folded version for fast comparisons.

The CaseFolding.txt file in the Unicode Character Database is used for performing locale-independent case-folding. This file is generated from the case mappings in the Unicode Character Database, using both the single-character mappings and the multi-character mappings. It folds all characters having different case forms together into a common form. To compare two strings for caseless matching, you can fold each string using this data, and then use a binary comparison.

For those concerned with the details. Case-folding logically involves a set of equivalence classes, constructed from the Unicode Character Database case mappings as follows.

For each character X in Unicode:

If X is already in an equivalence class, continue to next character.

Otherwise, form a new equivalence class, and add X.

Then add whatever upper-, lower- or titlecases to anything in the set.

Then add whatever anything in the set upper-, lower- or titlecases to.

Repeat #3 and #4 until nothing further is added.

Each equivalence class is completely disjoint from all the others, and together they form a partition of the entire Unicode code space. From each class, one representative element (a single lowercase letter where possible) is chosen to be the common form. CaseFolding.txt thus contains the mappings from other characters in the equivalence characters to their common forms.

Generally, where case distinctions are not important, other distinctions between Unicode characters (in particular, compatibility distinctions) are ignored as well. In such circumstances, text can be normalized to Normalization Form KC or KD after case-folding, to produce a normalized form that erases both compatibility distinctions and case distinctions. (See UTR #15: Unicode Normalization Forms for more information.) However, such normalization should generally only be done on a restricted repertoire, such as identifiers (alphanumerics).

Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Where locale-sensitive case matching is used, this information can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, in most environments, such as in file systems, text is not and cannot be tagged with locale information. In such cases, the locale-specific mappings must not be used. Otherwise data structures such as B-trees, might be built based on one set of case-foldings, and used based on a different set. This will cause those data structures to become corrupt. For such environments, a constant, locale-independent case-folding is required.

Modifications

The following summarizes modifications from the previous versions of this document.

4.3	Defined the sets lower, title, upper, and uniqueUpper instead of relying on the general category. Introduced UCD_title, UCD_upper, UCD_lower notation. Reordered sections of text for clarity Minor editing
4.2	Fixed pointer for CaseFolding.txt to point to the UCD Added text to describe the CaseFolding.txt generation in terms of equivalence classes Added Modification section Minor editing

Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.