Unicode Standard Annex #44

Unicode Character Database

Version	Unicode 5.1
Authors	Mark Davis (markdavis@google.com) and Ken Whistler (ken@unicode.org)
Date	2008-3-18
This Version	http://www.unicode.org/reports/tr44/tr44-2.html
Previous Version	n/a
Latest Version	http://www.unicode.org/reports/tr44/
Revision	2

Summary

This annex consolidates information documenting the Unicode Character Database.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1. Introduction
2. Conformance
- 2.1 Simple and Derived Properties
3. Documentation Files
- 3.1 UCD.html
- 3.2 NamesList.html
- 3.3 Unihan.html
- 3.4 StandardizedVariants.html
- 3.5 Data File Comments
4. Test Files
- 4.1 NormalizationTest.txt
- 4.2 LineBreakTest.txt
- 4.3 Segmentation Test Files
5. UCD in XML
Acknowledgments
References
Modifications

Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 5.0.0 of the standard unless otherwise indicated.

1 Introduction

The Unicode Character Database (UCD) is a collection of data files which contain the Unicode character code points and character names and which define the Unicode character properties and mappings between Unicode characters (such as case mappings).

This annex describes the UCD and provides a guide to the various documentation files associated with it.

The current version of the UCD is always located on the Unicode Web site at:

http://www.unicode.org/Public/UNIDATA/

The specific files for the UCD associated with this version of the Unicode Standard (5.1.0) are located at:

http://www.unicode.org/Public/5.1.0/

Stable, archived versions of the UCD associated with all earlier versions of the Unicode Standard can be accessed from:

http://www.unicode.org/ucd/

See Section 4.1, "Unicode Character Database", in [Unicode] for a general discussion of the UCD and its use in defining properties.

2 Conformance

The Unicode Character Database is an integral part of the Unicode Standard.

The UCD contains normative property and mapping information required for implementation of various Unicode algorithms such as the Unicode Bidirectional Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also contain additional informative and provisional character property information.

Each specification of a Unicode algorithm, whether specified in the text of [Unicode] or in one of the Unicode Standard Annexes, designates which data file(s) in the UCD are required for providing normative property information required by that algorithm.

For information on the meaning and application of the terms normative, informative, and provisional, see Section 3.5, "Properties" in [Unicode].

2.1 Simple and Derived Properties

Some character properties in the UCD are simple properties. This status has no bearing on whether or not the properties are normative, but merely indicates that their values are not derived from some combination of other properties.

Other character properties are derived. This means that their values are derived by rule from some other combination of properties. Generally such rules are stated as set operations, and may or may not include explicit exception lists for individual characters.

Sometimes simple properties are defined merely to make the statement of the rule defining a derived property more compact or general. Such properties are known as contributory properties and typically have their names prefixed by "Other_". Sometimes these contributory properties are defined to encapsulate the messiness inherent in exception lists. At other times, a contributory property may be defined to help stabilize the definition of an important derived property which is subject to stability guarantees.

Derived character properties are not considered second-class citizens among Unicode character properties. They are defined to make implementation of important algorithms easier to state. Included among the first-class derived properties important for such implementations are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt, and derived properties for optimization of normalization, defined in DerivedNormalizationProps.txt.

Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.

If there are any cases of mismatches between the definition of a derived property as listed in DerivedCoreProperties.txt or similar data files in the UCD, and the definition of a derived property as a set definition rule, the explicit listing in the data file should always be taken as the normative definition of the property.

Definitions of property derivations are provided for information only, typically in comment fields in the data files. Such definitions may be refactored, refined, or corrected over time. To ensure that there is never any ambiguity between versions of the standard, even if the definition of a derivation is changed at some point in time, the exact property listing in the data files for any given version of the standard is always the truth for that property value for that version—and will itself never change for that version.

3 Documentation Files

The UCD also contains a number of documentation files, which provide information about the UCD as a whole, and about file formats, status, derivation of derived properties, and various other information.

3.1 UCD.html

UCD.html is the most important of the documentation files. It provides a complete listing of the UCD data files and character properties. It indicates which properties are normative and where they are defined. It provides further information required for the proper interpretation of some of the Unicode character properties.

UCD.html also records the modification history for the data files in the UCD, noting changes from version to version of the standard.

3.2 NamesList.html

NamesList.html formally describes (in BNF) the format of the NamesList.txt data file, the file which is used to drive the printing of the Unicode code charts and names list. See also Section 17.1, "Character Names List", in [Unicode] for a detailed discussion of the conventions used in the names list.

3.3 Unihan.html and UAX #38

Unihan.html describes the format and content of Unihan.txt, the data file which collects together all property information for CJK Unified Ideographs. As of Version 5.1.0 of the Unicode Standard, the content of Unihan.html has been incorporated into the new [UAX38], which is intended to supersede Unihan.html.

3.4 StandardizedVariants.html

StandardizedVariants.html documents standardized variants, showing a representative glyph for each. It is closely tied to the data file, StandardizedVariants.txt , which defines those sequences normatively.

3.5. Data File Comments

In addition to the specific documentation files for the UCD, individual data files often contain extensive header comments describing their content and any special conventions used in the data. In some instances, individual property definition sections are also commented with information about how the property may be derived.

4 Test Files

The UCD also contains a number of test data files, which specify, in standard formats, data which can be used to test implementation of Unicode algorithms.

4.1. NormalizationTest.txt

This file contains data which can be used to test an implementation of the Unicode Normalization Algorithm. (See [UAX15].)

4.2. LineBreakTest.txt

This file, located in the auxiliary directory of the UCD, contains data which can be used to test an implementation of the Unicode Linebreaking Algorithm. (See [UAX14].)

There is an associated documentation file, LineBreakTest.html, which displays the results of the Linebreaking Algorithm in an interactive chart form, with a documented listing of the rules.

4.3. Segmentation Test Files

The following three data files are also located in the auxiliary directory of the UCD:

GraphemeBreakTest.txt
SentenceBreakTest.txt
WordBreakTest.txt

They contain data which can be used to test an implementation of the segmentation algorithms specified in [UAX29].

There are also associated documentation files, which display the results of the segmentation algorithms in an interactive chart form, with a documented listing of the rules:

GraphemeBreakTest.html
SentenceBreakTest.html
WordBreakTest.html

5 UCD in XML

[UAX42] defines an XML schema which is used to incorporate all of the Unicode character property information into an XML version of the UCD.

Starting with Version 5.1.0, a set of XML data files using that schema are also released with each version of the UCD. Those data files make it possible to import and process the UCD property data using standard XML parsing tools, instead of the specialized parsing required for the various individual data files of the UCD.

Acknowledgments

Mark Davis and Ken Whistler are the authors of the initial version and have added to and maintained the text of this annex.

References

For references for this annex, see Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.”

Modifications

For details of the change history, see the online copy of this annex at http://www.unicode.org/reports/tr44/.

The following summarizes modifications from previous revisions of this annex.

Revision 2

Initial approved version

Revision 1

Initial draft

Copyright © 2000-2008 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.