L2/99-190

PROPOSED DRAFT Unicode Technical Report #21
Case Mappings

Revision

1

Authors

Mark Davis (mark@unicode.org)

Date

1999-04-30

This Version

n/a

Previous Version

n/a

Latest Version

n/a

Unicode Technical Reports

http://www.unicode.org/unicode/techreports.html

Summary

This document presents guidelines for case operations: case conversion, case detection, and caseless matching.

Status of this document

This document is an unpublished, preliminary working draft. It is posted for general review. At its next meeting, the Unicode Technical Committee (UTC) may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.


Introduction

The case mappings in the Unicode Character Database are informative, default mappings. Case itself, on the other hand, has normative status. Thus, for example, 0041 "A" is normatively uppercase, but its lowercase mapping to 0061 "a" is informative. The reason for this is that case can be considered to be an inherent property of a particular character (and is usually, but not always, derivable from the presence of the terms "CAPITAL" or "SMALL" in the character name), but case mappings between characters are occasionally influenced by local conventions.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about locale-specific case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt, which has been added starting with the 2.1.8 update to the Unicode data files. SpecialCasing.txt contains additional informative case mappings that are either not one-to-one or which are context-sensitive. This data should be included when implementing case operations.

Guidelines

There are a number of fine points in case operations that programmers need to be aware of, especially with regard to titlecases. In all of the guidelines given below:

Case Conversion

toUppercase(String s)

toLowercase(String s)

toTitlecase(String s)

Case Detection

isLowercase(String s)

isUppercase(String s)

isTitlecase(String s)

Caseless Matching

Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Generally, where case distinctions are not important, other distinctions between Unicode characters are ignored as well. This can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, there are many circumstances where a locale-insensitive loose match is required. The best way to do this is by using the following test:

toUppercase(canonicalDecomposition(a)) == toUppercase(canonicalDecomposition(b))

However, if the strings are in normalized form (see UTR #15: Unicode Normalization Forms), a simpler test can be used:

toUppercase(a) == toUppercase(b)

Neither of these take account of locale variations (e.g. Turkish). If a locale-insensitive match is required, then the recommendation is to use a slightly looser match by also mapping 0130 "_x130;" capital i with dot to 0049 "I" capital i. In this case, the distinction among the dotted and undotted I's is lost.


Copyright

Copyright © 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.