L2/99-190

PROPOSED DRAFT Unicode Technical Report #21
Case Mappings

Revision	1
Authors	Mark Davis (mark@unicode.org)
Date	1999-04-30
This Version	n/a
Previous Version	n/a
Latest Version	n/a
Unicode Technical Reports	http://www.unicode.org/unicode/techreports.html

Summary

This document presents guidelines for case operations: case conversion, case detection, and caseless matching.

Status of this document
This document is an unpublished, preliminary working draft. It is posted for general review. At its next meeting, the Unicode Technical Committee (UTC) may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.

Introduction

The case mappings in the Unicode Character Database are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character (and is usually, but not always, derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name), but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

Because of the inclusion of certain composite characters for compatibility, such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which is used where the first letter of a word is to be capitalized (e.g. Titlecase, vs. UPPERCASE, or lowercase). For example, the title case of the example character is 01F2 "Dz" capital d with small z.
Case mappings may produce strings of different length than the original. For example, the German character 00DF "_xDF;" small letter sharp s expands when uppercased to the sequence of two characters "SS". This also occurs where there is no precomposed character corresponding to a case mapping.
Characters may also have different case mappings, depending on the context.
For example, 03A3 "_x3A3;" capital sigma lowercases to 03C3 "_x3C3;" small sigma if it is not followed by another letter, but lowercases to 03C2 "_x03C2;" small final sigma if it is.
Characters may have case mappings that depend on the locale.
For example, in Turkish the letter 0049 "I" capital letter i lowercases to 0131 "_x131;" small dotless i.
Case mappings are not, in general, reversible. For example, once the string "McGowan" has been upper, lower or titlecased, the original cannot be recovered by applying another upper, lower, or titlecase operation.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about locale-specific case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt, which has been added starting with the 2.1.8 update to the Unicode data files. SpecialCasing.txt contains additional informative case mappings that are either not one-to-one or which are context-sensitive. This data should be included when implementing case operations.

Guidelines

There are a number of fine points in case operations that programmers need to be aware of, especially with regard to titlecases. In all of the guidelines given below:

Start with the case mappings found in both the UniData.txt and SpecialCasing.txt files.
Treat 0345 iota subscript as a lowercase letter.
Treat all letters whose canonical decomposition ends with iota subscript as titlecase letters.
For any character that is Lu, but whose uppercase mapping is not the same as its titlecase mapping, set it to a new type Ld (distinct-uppercase)
A character is cased if it is marked as uppercase, lowercase, titlecase, or distinct-uppercase (Lu, Ll, Lt, Ld)
When looking at preceding or following letters, disregard any intervening non-spacing marks.

Case Conversion

toUppercase(String s)

Map each character to its uppercase.
Remember to use special mappings (e.g. Turkish) for some locales.

toLowercase(String s)

Map each character to its lowercase.
Remember to handle FINAL and NON_FINAL mappings correctly (e.g. Greek CAPITAL SIGMA). If the letter is followed by a cased letter, chose the NON_FINAL form, otherwise chose the FINAL form.
Remember to use special mappings (e.g. Turkish) for some locales.

toTitlecase(String s)

Map each character to its titlecase or lowercase. If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase mapping (in most cases, this will be the same as the uppercase, but not always).
Remember to handle FINAL and NON_FINAL mappings correctly (e.g. Greek CAPITAL SIGMA). If the letter is followed by a cased letter, chose the NON_FINAL form, otherwise chose the FINAL form.
Remember to use special mappings (e.g. Turkish) for some locales.

Case Detection

isLowercase(String s)

there is at least one cased character
there are no titlecase, uppercase, or distinct-uppercase characters

isUppercase(String s)

there is at least one cased character
there are no titlecase or lowercase characters

isTitlecase(String s)

there is at least one cased character
there are no distinct-uppercase (Ld) characters
any lowercase letters must follow cased characters
there are no titlecase or uppercase letters, except following uncased characters

Caseless Matching

Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Generally, where case distinctions are not important, other distinctions between Unicode characters are ignored as well. This can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, there are many circumstances where a locale-insensitive loose match is required. The best way to do this is by using the following test:

toUppercase(canonicalDecomposition(a)) == toUppercase(canonicalDecomposition(b))

However, if the strings are in normalized form (see UTR #15: Unicode Normalization Forms), a simpler test can be used:

toUppercase(a) == toUppercase(b)

Neither of these take account of locale variations (e.g. Turkish). If a locale-insensitive match is required, then the recommendation is to use a slightly looser match by also mapping 0130 "_x130;" capital i with dot to 0049 "I" capital i. In this case, the distinction among the dotted and undotted I's is lost.

Copyright

Copyright &COPY; 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.