[Unicode]  Unicode Character Database
 

UNICODE CHARACTER DATABASE

Revision 5.1.0 (draft 5)
Authors Mark Davis and Ken Whistler
Date 2008-01-16
This Version http://www.unicode.org/Public/5.1.0/ucd/UCD.html
Previous Version http://www.unicode.org/Public/5.0.0/ucd/UCD.html
Latest Version http://www.unicode.org/Public/UNIDATA/UCD.html


Summary

This document describes the format and content of the Unicode Character Database (UCD)

Status

This file and the files described herein are part of the Unicode Character Database and are governed by the terms of use at http://www.unicode.org/terms_of_use.html.

The References provide related information that is useful in understanding this document.

Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 5.0.0 of the standard unless otherwise indicated.

Contents

Introduction

The Unicode Character Database (UCD) is a set of files that define the Unicode character properties and internal mappings. This document describes the properties and files that are part of The Unicode Standard, Version 5.1.0 [U5.1.0]. For a description of the changes in this version, see Modification History.

The file structure for the UCD changed in Version 4.1.0. From that point on, the successive versions of the UCD are complete versions, so that users of the standard do not need to assemble the correct version of each file from different update directories for previous versions in order to have a complete set of files for a version. Each version is in a directory of the following form:

http://www.unicode.org/Public/5.1.0/ucd/

Within this directory the structure is the same as in versions prior to 4.1.0, with two changes:

Conformance

For information on the meaning and application of the terms normative, informative, and provisional, see Section 3.5, "Properties" in the Unicode Standard, Version 5.0.

UCD File Format

Files in the UCD use the following format, unless otherwise specified.

UCD Property Files

The following table describes the format and meaning of each property data file in the main directory of the UCD. (An index by property name, rather than file, is found at Properties.) The first column lists the files and the properties for which they contain data. The second column indicates the type of the property: String, Numeric, Enumeration (non-binary), Binary, Catalog, or Miscellaneous. Catalog properties have enumerated values which are expected to be regularly extended with successive versions of the Unicode Standard. This distinguishes them from Enumeration properties, whose enumerated values constitute a logical partition space, for which new values will generally not be added in successive versions of the standard. An example of a Catalog property is the Block property. Miscellaneous properties do not fit into the other property categories, and currently include character names, comments about characters, or the Unicode_Radical_Stroke property (a combination of numeric values). The third column indicates the status (Normative, Informative, or Provisional), and the fourth column provides a description of the data.

The files with a small number of properties are listed first, followed by the files with a large number of properties: DerivedCoreProperties.txt, DerivedNormalizationProps.txt, Proplist.txt, and UnicodeData.txt. For UnicodeData, the field numbers are supplied in the description. In a number of cases, fields in a data file only contribute to a UCD property; for example, the name field in UnicodeData.txt does not provide all the values for the Name property; Jamo.txt must be used as well.

None of these properties should be used without consulting the relevant discussions in the Unicode Standard.

Where a data file does not explicitly list property values for all code points, the code points are given default property values. These default property values are documented in the data files, with the exception of UnicodeData.txt. For that case the default property values are listed below in parentheses after the property name, with (=) indicating the code point itself.  The default property values are also documented in any corresponding extracted data file.

ArabicShaping.txt
Joining_Type
Joining_Group
E N Basic Arabic and Syriac character shaping properties, such as initial, medial and final shapes. See Section 8.2 in [Unicode].


BidiMirroring.txt 
Bidi_Mirroring_Glyph S I Properties for substituting characters in an implementation of bidirectional mirroring. See UAX #9: The Bidirectional Algorithm [BIDI]. Do not confuse this with the Bidi_Mirrored property.
Blocks.txt 
Block C N List of block names, which are arbitrary names for ranges of code points. See Chapter 17 in [Unicode].
CompositionExclusions.txt 
Composition Exclusion B N Properties for normalization. See UAX #15: Unicode Normalization Forms [Norm]. Unlike other files, CompositionExclusions simply lists the relevant code points.
CaseFolding.txt 
Simple_Case_Folding
Case_Folding
S N Mapping from characters to their case-folded forms. This is an informative file containing normative derived properties.

Note: The value may be omitted in the data file if it is the same as the code point itself.

Derived from UnicodeData and SpecialCasing.

DerivedAge.txt 
Age C N/I This file shows when various code points were designated/assigned in successive versions of the Unicode standard.
EastAsianWidth.txt 
East_Asian_Width E I Properties for determining the choice of wide vs. narrow glyphs in East Asian contexts. Property values are described in UAX #11: East Asian Width [Width].

HangulSyllableType.txt

Hangul_Syllable_Type
 
E N The values L, V, T, LV, and LVT used in Chapter 3 in [Unicode].

Jamo.txt

Jamo_Short_Name
 
S N The Hangul Syllable names are derived from the Jamo Short Names, as described in Chapter 3 in [Unicode].
LineBreak.txt 
Line_Break E N Properties for line breaking. For more information, see UAX #14: Line Breaking Properties [Line].

NameAliases.txt

Name_Alias
 
M N Normative formal aliases for character with erroneous names as described in Chapter 4. These aliases match exactly the formal aliases published in the code charts of the Unicode Standard.

NormalizationCorrections.txt 

used in Decomposition Mappings S N NormalizationCorrections lists code point differences for Normalization Corrigenda. For more information, see UAX #15: Unicode Normalization Forms [Norm].
PropertyAliases.txt
n/a S N/I Property names and abbreviations. These names can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data.
PropertyValueAliases.txt
n/a S N/I Property value names and abbreviations. These names can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data.
Scripts.txt 
Script C I Default script values for use in regular expressions. For more information, see UAX #24: Script Names [Script].
SpecialCasing.txt
Uppercase_Mapping
Lowercase_Mapping
Titlecase_Mapping
Special_Case_Condition
S I Data for producing (in combination with Unicode Data) the full case mappings.

Note: The value may be omitted in the data file if it is the same as the code point itself; in the case of Titlecase_Mapping, if it is the same as the uppercase.

Unihan.txt (for more information, see Unihan.html)
Numeric_Type
Numeric_Value
E I The characters tagged with kPrimaryNumeric, kAccountingNumeric, and kOtherNumeric are given the Numeric_Type numeric, and the values indicated.

Most characters have these properties based on values from the UnicodeData.txt data file. See Numeric_Type.

Unicode_Radical_Stroke

 

M I The Unicode radical stroke count, based on the tag kRSUnicode.
DerivedCoreProperties.txt 
Alphabetic B I Characters with the Alphabetic property. For more information, see Chapter 4 in [Unicode].

Generated from: Other_Alphabetic + Lu + Ll + Lt + Lm + Lo + Nl

Default_Ignorable_Code_Point B N For programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported. For more information, see the FAQ Display of Unsupported Characters, and Section 5.20 in [Unicode].

Generated from Other_Default_Ignorable_Code_Point
+ Cf + Cc + Cs + Noncharacter_Code_Point
+ Variation_Selector
- White_Space
- FFF9..FFFB (annotation characters)
- 0600..0603, 06DD, 070F (special Arabic and Syriac formatting characters)

Lowercase B I Characters with the Lowercase property. For more information, see Chapter 4 in [Unicode].

Generated from: Other_Lowercase + Ll

Grapheme_Base B I For programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries [Breaks].

Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - Grapheme_Extend

Grapheme_Extend B I For programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries [Breaks].

Generated from: Other_Grapheme_Extend + Me + Mn

Note: depending on an application's interpretation of Co (private use), they may be either in Grapheme_Base, or in Grapheme_Extend, or in neither.

ID_Start B I Used to determine programming identifiers, as described in UAX #31: Identifier and Pattern Syntax [Pattern]
ID_Continue B I
Math B I Characters with the Math property. For more information, see Chapter 4 in [Unicode].

Generated from: Sm + Other_Math

Uppercase B I Characters with the Uppercase property. For more information, see Chapter 4 in [Unicode].

Generated from: Lu + Other_Uppercase

XID_Start B I Used to determine programming identifiers, as described in UAX #31: Identifier and Pattern Syntax [Pattern]
XID_Continue B I
DerivedNormalizationProps.txt 
Full_Composition_Exclusion B N Characters that are excluded from composition: those explicitly in CompositionExclusions.txt, plus:
(3) Singleton Decompositions
(4) Non-Starter Decompositions
Expands_On_NFC
Expands_On_NFD
Expands_On_NFKC
Expands_On_NFKD
B N Characters that expand to more than one character in the specified normalization form.
FC_NFKC_Closure S N Characters that require extra mappings for closure under Case Folding plus Normalization Form KC. Characters marked with this property have a third field with the mapping in it. Generated with the following, where Fold is the default fold operation (not Turkic):
b = NFKC(Fold(a));
c = NFKC(Fold(b));
if (c != b) add mapping from a to c

Note: The value may be omitted in the data file if it is the same as the code point itself.

NFD_Quick_Check
NFKD_Quick_Check
NFC_Quick_Check
NFKC_Quick_Check
E N For property values, see Decompositions and Normalization.
Proplist.txt 
ASCII_Hex_Digit B N ASCII characters commonly used for the representation of hexadecimal numbers.
Bidi_Control B N Those format control characters which have specific functions in the Bidirectional Algorithm.
Dash B I Those punctuation characters explicitly called out as dashes in the Unicode Standard, plus compatibility equivalents to those. Most of these have the Pd General Category, but some have the Sm General Category because of their use in mathematics.
Deprecated B N For a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged.
Diacritic B I Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.
Extender B I Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.
Grapheme_Link B N Used in determining default grapheme cluster boundaries. For more information, see UAX #29: Text Boundaries [Breaks].
Hex_Digit B I Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents.
Hyphen (Stabilized as of 3.2) B I Those dashes used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash.
Ideographic B I Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs.
IDS_Binary_Operator B N Used in Ideographic Description Sequences.
IDS_Trinary_Operator B N Used in Ideographic Description Sequences.
Join_Control B N Those format control characters which have specific functions for control of cursive joining and ligation.
Logical_Order_Exception B N There are a small number of characters that do not use logical order. These characters require special handling in most processing.
Noncharacter_Code_Point B N Code points that are permanently reserved for internal use.
Other_Alphabetic B I Used in deriving the Alphabetic property.
Other_Default_Ignorable_Code_Point B N Used in deriving the Default_Ignorable_Code_Point property.
Other_Grapheme_Extend B N Used in deriving  the Grapheme_Extend property.
Other_ID_Continue B N Used for backwards compatibility of ID_Continue
Other_ID_Start B N Used for backwards compatibility of ID_Start
Other_Lowercase B I Used in deriving the Lowercase property.
Other_Math B I Used in deriving  the Math property.
Other_Uppercase B I Used in deriving the Uppercase property.
Pattern_Syntax B N Used for pattern syntax as described in UAX #31: Identifier and Pattern Syntax [Pattern].
Pattern_White_Space B N
Quotation_Mark B I Those punctuation characters that function as quotation marks.
Radical B N Used in Ideographic Description Sequences.
Soft_Dotted B N Characters with a "soft dot", like i or j. An accent placed on these characters causes the dot to disappear. An explicit dot above can be added where required, such as in Lithuanian.
STerm B I Sentence Terminal. Used in UAX #29: Text Boundaries [Breaks].
Terminal_Punctuation B I Those punctuation characters that generally mark the end of textual units.
Unified_Ideograph B N Used in Ideographic Description Sequences.
Variation_Selector B N Indicates all those characters that qualify as Variation Selectors. For details on the behavior of these characters, see StandardizedVariants.html and Section 16.4, Variation Selectors in [Unicode].
White_Space B N Those separator characters and control characters which should be treated by programming languages as "white space" for the purpose of parsing elements.

Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, since their functions are restricted to line-break control. Their names are unfortunately misleading in this respect.

Note: There are other senses of "whitespace" that encompass a different set of characters.

UnicodeData.txt 

Name* (<reserved>) M N (1) These names match exactly the names published in the code charts of the Unicode Standard. The Hangul Syllable names are omitted from this file; see Jamo.txt.
General_Category (Cn) E N (2) This is a useful breakdown into various character types which can be used as a default categorization in implementations. For the property values, see General Category Values.
Canonical_Combining_Class (0) N N (3) The classes used for the Canonical Ordering Algorithm in the Unicode Standard. For the property value names associated with different numeric values, see DerivedCombiningClass.txt and Canonical Combining Class Values.
Bidi_Class (L, AL, R) E N (4) These are the categories required by the Bidirectional Behavior Algorithm in the Unicode Standard. For the property values, see Bidi Class Values. For more information, see UAX #9: The Bidirectional Algorithm [BIDI].

The default property values depend on the code point, and are given in extracted/DerivedBidiClass.txt

Decomposition_Type (None)
Decomposition_Mapping (=)
E
S
N (5) This field contains both values, with the type in angle brackets. The decomposition mappings match exactly the decomposition mappings published with the character names in the Unicode Standard. For more information, see Character Decomposition Mappings

Note: The decomposition mapping may be omitted in the data file if it is the same as the code point itself.

Numeric_Type (None)
Numeric_Value (Not a Number)
E
N
N (6) If the character has the decimal digit property, as specified in Chapter 4 of the Unicode Standard, then the value of that digit is represented with an integer value in fields 6, 7, and 8.
E
N
N (7) If the character has the digit property, but is not a decimal digit, then the value of that digit is represented with an integer value in fields 7 and 8. This covers digits that need special handling, such as the compatibility superscript digits.
E
N
N (8) If the character has the numeric property, as specified in Chapter 4 of the Unicode Standard, the value of that character is represented with a positive or negative integer or rational number in this field. This includes fractions such as, e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.

Some characters have these properties based on values from the Unihan data file. See Numeric_Type, Han.

Bidi_Mirrored (N) B N (9) If the character has been identified as a "mirrored" character in bidirectional text, this field has the value "Y"; otherwise "N". The list of mirrored characters is also printed in Chapter 4 of the Unicode Standard. Do not confuse this with the Bidi_Mirroring_Glyph property.
Unicode_1_Name (<none>) M I (10) This is the old name as published in Unicode 1.0. This name is only provided when it is significantly different from the current name for the character. The value of field 10 for control characters does not always match the Unicode 1.0 names. Instead, field 10 contains ISO 6429 names for control functions, for printing in the code charts.
ISO_Comment (<none>) M I (11) This is the ISO 10646 comment field. It appears in parentheses in the 10646 names list, or contains an asterisk to mark an Annex P note.
Simple_Uppercase_Mapping (=) S N (12) Simple uppercase mapping (single character result). If a character is part of an alphabet with case distinctions, and has a simple upper case equivalent, then the upper case equivalent is in this field. See the explanation below on case distinctions. The simple mappings have a single character result, where the full mappings may have multi-character results. For more information, see Case Mappings.

Note: The simple uppercase may be omitted in the data file if the uppercase is the same as the code point itself.

Simple_Lowercase_Mapping (=) S N (13) Simple lowercase mapping (single character result). Similar to Uppercase mapping.

Note: The simple lowercase may be omitted in the data file if the lowercase is the same as the code point itself.

Simple_Titlecase_Mapping (=) S N (14) Similar to Uppercase mapping (single character result).

Note: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase.

Note:

Stabilized properties are no longer actively maintained, nor are they extended as new characters are added.

Auxiliary Property Files

A number of auxiliary properties are contained in files in the auxiliary subdirectory. They consist of the following:

GraphemeBreakProperty.txt   N/I  
Grapheme_Cluster_Break E I

See UAX #29: Text Boundaries [Breaks]

SentenceBreakProperty.txt      
Sentence_Break E I

See UAX #29: Text Boundaries [Breaks]

WordBreakProperty.txt      
Word_Break E I

See UAX #29: Text Boundaries [Breaks]


Derived Extracted Property Files

The following properties of the UCD have been separated out, reformatted, and listed in range format, one property per file, except as noted. These files are provided purely as a reformatting of existing data, any exceptions are noted in the table below. All files for derived extracted properties are contained in a subdirectory called extracted.

Files N/I Definition and Generation
DerivedBidiClass N From UnicodeData.txt, field 4
DerivedBinaryProperties N The Bidi_Mirrored property from UnicodeData.txt, field 9. See Bidi Note.
DerivedCombiningClass N From UnicodeData.txt, field 3
DerivedDecompositionType * From the <tag> in UnicodeData.txt, field 5. For characters with canonical decomposition mappings (no tag), the value "canonical" is used.

* The value "canonical" is normative; the others are informative.

DerivedEastAsianWidth I From EastAsianWidth.txt, field 1
DerivedGeneralCategory N From UnicodeData.txt, field 2
DerivedJoiningGroup N From ArabicShaping.txt, field 2
DerivedJoiningType N From ArabicShaping.txt, field 1
DerivedLineBreak N From LineBreak.txt, field 1. For more information, see UAX #14: Line Breaking Properties [Line].
DerivedNumericType N The property value is based on the contents of UnicodeData.txt, fields 6 through 8:
 
property value non-empty fields
decimal 6, 7, & 8
digit 7 & 8
numeric 8
DerivedNumericValues N The numeric value from UnicodeData.txt, field 8

Bidi Note: The BidiMirrored property and the BidiMirroring property are different. The former is a normative property that indicates whether characters are mirrored in a right-to-left context in the Unicode Bidirectional Algorithm. The latter is an informative mapping of a subset of the BidiMirrored characters, to characters that normally have the corresponding mirrored glyph.

Other UCD Files

The following files in the Unicode Character Database are not used directly for Unicode properties.  For more information about these files, see the referenced technical report(s), files, or section of Unicode Standard.

".txt" File Description N/I Summary
Index Chapter 17 I Index to Unicode characters, as printed in the Unicode Standard.
NamesList Chapter 17 I This file duplicates some of the material in the UnicodeData file, and adds annotations used in the character charts.
NormalizationTest UAX #15 N Test file for conformance to Unicode Normalization Forms.

See UAX #15: Unicode Normalization Forms [Norm]

StandardizedVariants Chapter 16 N Lists all the standardized variant sequences that have been defined, plus a description of the desired appearance. StandardizedVariants.html contains this information, plus a sample glyph showing the desired features.
NamedSequences UAX#34 N List the names for all approved named sequences.
NamedSequencesProv UAX#34 P Lists the names for all provisional named sequences.

 

Properties

The following table lists the properties in the UCD. They are roughly organized into groups based on the usage of the property (this grouping is purely for convenience, and has no other implications). The link on each property leads to description in the file index. The contributory properties (those of the form Other_XXX) are sets of exceptions used to generate properties in DerivedCoreProperties.txt. They are incomplete by themselves and not intended for independent use, for example an API returning property values would implement the corresponding derived core property instead.

General Decomposition and Normalization CJK
Name Canonical_Combining_Class Ideographic
Name_Alias Decomposition_Mapping Unified_Ideograph
Block Composition_Exclusion Radical
Age Full_Composition_Exclusion IDS_Binary_Operator
General_Category Decomposition_Type IDS_Trinary_Operator
Script FC_NFKC_Closure Unicode_Radical_Stroke
White_Space NFC_Quick_Check Misc
Alphabetic NFKC_Quick_Check Math
Hangul_Syllable_Type NFD_Quick_Check Quotation_Mark
Noncharacter_Code_Point NFKD_Quick_Check Dash
Default_Ignorable_Code_Point Expands_On_NFC Hyphen
Deprecated Expands_On_NFD STerm
Logical_Order_Exception Expands_On_NFKC Terminal_Punctuation
Variation_Selector Expands_On_NFKD Diacritic
Case   Extender
Uppercase Shaping and Rendering Grapheme_Base
Lowercase Join_Control Grapheme_Extend
Lowercase_Mapping Joining_Group Grapheme_Link
Titlecase_Mapping Joining_Type Unicode_1_Name
Uppercase_Mapping Line_Break ISO_Comment
Case_Folding Grapheme_Cluster_Break  
Simple_Lowercase_Mapping Sentence_Break  
Simple_Titlecase_Mapping Word_Break  
Simple_Uppercase_Mapping East_Asian_Width  
Simple_Case_Folding Bidi Contributory Properties
Special_Case_Condition Bidi_Control Other_Alphabetic
Soft_Dotted Bidi_Mirrored Other_Default_Ignorable_Code_Point
Identifiers Bidi_Class Other_Grapheme_Extend
ID_Continue Bidi_Mirroring_Glyph Other_ID_Start
ID_Start Numeric Other_ID_Continue
XID_Continue Numeric_Value Other_Lowercase
XID_Start Numeric_Type Other_Math
Pattern_Syntax Hex_Digit Other_Uppercase
Pattern_White_Space ASCII_Hex_Digit Jamo_Short_Name

 

Property and Property Value Matching

Properties and property values may have multiple aliases, such as abbreviated names and longer, more descriptive names. For example, one can write either Line_Break or LB for the Line Break property, and either OP or Open_Punctuation for one of its values. When matching property names and values, it is strongly recommended that all aliases in the UCD be recognized, and that loose matching should be applied to all property names and property values according to the following:

For a general discussion of Unicode character properties, see UTR #23: the Unicode Character Property Model [UTR23].

Numeric Properties

For all numeric properties, and properties such as Unicode_Radical_Stroke that are combinations of numeric values, use the following loose matching rule:

LM1. Apply numeric equivalences

Character Names

LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens except the hyphen in U+1180.

Others

For all property names, property value names, and for property values for Enumerated, Binary, or Catalog properties, use the following loose matching rule:

LM3. Ignore case, whitespace, underscore ('_'), and hyphens.

Otherwise loose matching should not be done for the property values of String properties, as case distinctions or other distinctions in those values may be significant.

Property Invariants

Values in the UCD are subject to correction as errors are found; however, some characteristics of the properties and files are considered invariants. Applications may wish to take these invariants into account when choosing how to implement character properties. All formally guaranteed invariants of property values are described in Unicode Policies. The following lists some additional invariants regarding file organization and more detail on a few of the invariants in the Unicode Policies.

UnicodeData Fields

Combining Classes

Decimal Digits

Property Values

The following gives a summary of property values for certain properties. Other property values are documented in other locations; for example, the line breaking property values are documented in UAX #14: Line Breaking Properties [Line].

General Category Values

The General_Category property of a code point provides for a most basic classification of that code point. It is usually determined based on the primary characteristic of the assigned character for that code point. For example, is it a letter, a mark, a number, punctuation, or a symbol, and if so, what type? Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value. For more information, see Chapter 4 in [Unicode].

The values in the General_Category field in UnicodeData.txt are abbreviations for the longer descriptions enumerated in the table below.

Abbr.

Description

Lu Letter, Uppercase
Ll Letter, Lowercase
Lt Letter, Titlecase
Lm Letter, Modifier
Lo Letter, Other
Mn Mark, Nonspacing
Mc Mark, Spacing Combining
Me Mark, Enclosing
Nd Number, Decimal Digit
Nl Number, Letter
No Number, Other
Pc Punctuation, Connector
Pd Punctuation, Dash
Ps Punctuation, Open
Pe Punctuation, Close
Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po Punctuation, Other
Sm Symbol, Math
Sc Symbol, Currency
Sk Symbol, Modifier
So Symbol, Other
Zs Separator, Space
Zl Separator, Line
Zp Separator, Paragraph
Cc Other, Control
Cf Other, Format
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned (no characters in the file have this property)

Note: The term "L&" is used to stand for Uppercase, Lowercase or Titlecase letters (Lu, Ll, or Lt) in comments. The LC value in PropertyValueAliases.txt also stands for Uppercase, Lowercase or Titlecase letters.

Note: The Unicode Standard does not assign information to control characters (except for certain cases). Implementations will generally also assign categories to certain control characters, notably CR and LF, according to platform conventions. See Section 5.8 "Newline Guidelines" in [Unicode] for more information.

Bidi Class Values

Please refer to UAX #9: The Bidirectional Algorithm [BIDI] for an explanation of the algorithm for Bidirectional Behavior and an explanation of the significance of these categories.

Type

Description

L Left-to-Right
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
AL Right-to-Left Arabic
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
ES European Number Separator
ET European Number Terminator
AN Arabic Number
CS Common Number Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals

 

Character Decomposition Mapping

The tags supplied with certain decomposition mappings generally indicate formatting information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a formatting tag also indicates that the mapping is a compatibility mapping and not a canonical mapping. In the absence of other formatting information in a compatibility mapping, the tag is used to distinguish it from canonical mappings.

In some instances a canonical mapping or a compatibility mapping may consist of a single character. For a canonical mapping, this indicates that the character is a canonical equivalent of another single character. For a compatibility mapping, this indicates that the character is a compatibility equivalent of another single character. The compatibility formatting tags used are:

Tag

Description

<font>   A font variant (e.g. a blackletter form).
<noBreak>   A no-break version of a space or hyphen.
<initial>   An initial presentation form (Arabic).
<medial>   A medial presentation form (Arabic).
<final>   A final presentation form (Arabic).
<isolated>   An isolated presentation form (Arabic).
<circle>   An encircled form.
<super>   A superscript form.
<sub>   A subscript form.
<vertical>   A vertical layout presentation form.
<wide>   A wide (or zenkaku) compatibility character.
<narrow>   A narrow (or hankaku) compatibility character.
<small>   A small variant form (CNS compatibility).
<square>   A CJK squared font variant.
<fraction>   A vulgar fraction form.
<compat>   Otherwise unspecified compatibility character.

Reminder: There is a difference between decomposition and decomposition mapping. The decomposition mappings are defined in the UnicodeData, while the decomposition (also termed "full decomposition") is defined in Chapter 3 to use those mappings recursively.

The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic mapping, as specified in Section 3.12, Conjoining Jamo Behavior in [Unicode]. That algorithm specifies the full decomposition of all precomposed Hangul syllables, but effectively it is equivalent to the recursive application of pairwise decomposition mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping (dm) property value for a Hangul syllable is the pairwise decomposition and not the full decomposition.

Each character with the Hangul_Syllable_Type value LVT will have a decomposition mapping consisting of a character with an LV value and a character with a T value. Thus for U+CE31 the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.

Canonical Combining Class Values

Value

Description

0: Spacing, split, enclosing, reordrant, and Tibetan subjoined
1: Overlays and interior
7: Nuktas
8: Hiragana/Katakana voicing marks
9: Viramas
10: Start of fixed position classes
199: End of fixed position classes
200: Below left attached
202: Below attached
204: Below right attached
208: Left attached (reordrant around single base character)
210: Right attached
212: Above left attached
214: Above attached
216: Above right attached
218: Below left
220: Below
222: Below right
224: Left (reordrant around single base character)
226: Right
228: Above left
230: Above
232: Above right
233: Double below
234: Double above
240: Below (iota subscript)

Note: some of the combining classes in this list do not currently have members but are specified here for completeness.

Decompositions and Normalization

Decomposition is specified in Chapter 3. UAX #15: Unicode Normalization Forms [Norm] specifies the interaction between decomposition and normalization. That report specifies how the decompositions defined in UnicodeData.txt are used to derive normalized forms of Unicode text.

Note that as of the 2.1.9 update of the Unicode Character Database, the decompositions in the UnicodeData.txt file can be used to recursively derive the full decomposition in canonical order, without the need to separately apply canonical reordering. However, canonical reordering of combining character sequences must still be applied in decomposition when normalizing source text which contains any combining marks.

The QuickCheck property values are as follows:

Property Value Description
NF*_QC No Characters that cannot ever occur in the respective normalization form. See Decompositions and Normalization.
NFC_QC, NFKC_QC Maybe Characters that may occur in the respective normalization, depending on the context. See Decompositions and Normalization.
NF*_QC Yes All other characters. This is the default value, and is not listed for individual characters or ranges in the file.


For more information, see Section 14 in UAX #15: Unicode Normalization Forms [Norm].

Case Mappings

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII. For more information, see Chapter 3 in Unicode 5.0.

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they are one-to-one mappings; it also omits information about context-sensitive case mappings. Information about these special cases can be found in a separate data file, SpecialCasing.txt.

Unihan Tags

A large number of properties specific to Han ideographs are contained in the Unihan Database, where they are called Unihan tags. The Unihan.txt file is described in Unihan.html.

Validating Property Values

Binary properties are expressed in the Unicode files with the values:

Value Abbr Alias Abbr
Yes Y True T
No N False F

The property values for strings and catalog values as expressed in the UCD files can be validated by using the following Regular Expression expressions. These expressions use Perl syntax, but may be translated for use with other regular expression engines. The last column lists the default values for these properties.

Regular Expressions for Property Values
Abbr Name Regex for Allowable Values Defaults for Unlisted Values
age Age /([0-9]+\.[0-9]|unassigned)/ unassigned
nv Numeric_Value /-?[0-9]+\.[0-9]+/ Field 2 NaN
/-?[0-9]+(\[0-9]+)?/ Field 3
blk Block /[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/ No_Block
sc Script Unknown (Zzzz)
dm Decomposition_Mapping /[\x{0}-\x{10FFFF}]+/ The code point itself, but # can be used to represent that in certain circumstances.
FC_NFKC FC_NFKC_Closure
cf Case_Folding /[\x{0}-\x{10FFFF}]+/
lc Lowercase_Mapping
tc Titlecase_Mapping
uc Uppercase_Mapping
sfc Simple_Case_Folding /[\x{0}-\x{10FFFF}]/
slc Simple_Lowercase_Mapping
stc Simple_Titlecase_Mapping
suc Simple_Uppercase_Mapping
bmg Bidi_Mirroring_Glyph /[\x{0}-\x{10FFFF}]?/ ""
isc ISO_Comment /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\)?/
na1 Unicode_1_Name /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*(\ \((CR|FF|LF|NEL)\))?)?/

null or empty string is the default for these property values, however in files the following can be used:
<reserved>, <control>, <private-use>, <surrogate>, <noncharacter>

The code point can also appear, in a form like <private-use-E000>. In some circumstances, such as a compact XML format, # can be used to stand for the code point to allow for name sharing.

na Name /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\)?/

 

References

[BIDI] UAX #9: The Bidirectional Algorithm
Latest version:
http://www.unicode.org/reports/tr9/
5.1.0 version:
http://www.unicode.org/reports/tr9/tr9-18.html
[Breaks] UAX #29: Text Boundaries
Latest Version:
http://www.unicode.org/reports/tr29/
5.1.0 version:
http://www.unicode.org/reports/tr29/tr29-13.html
[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Line] UAX #14: Line Breaking Properties
Latest Version:
http://www.unicode.org/reports/tr14/
5.1.0 version:
http://www.unicode.org/reports/tr14/tr14-22.html
[Norm] UAX #15: Unicode Normalization Forms
Latest Version:
http://www.unicode.org/reports/tr15/
5.1.0 version:
http://www.unicode.org/reports/tr15/tr15-29.html
[Pattern] UAX #31: Identifier and Pattern Syntax
Latest Version:
http://www.unicode.org/reports/tr31/
5.1.0 version:
http://www.unicode.org/reports/tr31/tr31-9.html
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Scripts] UAX #24 Script Names
http://www.unicode.org/reports/tr24/
5.1.0 version:
http://www.unicode.org/reports/tr24/tr24-11.html
[U5.0] The Unicode Standard Version 5.0
http://www.unicode.org/versions/Unicode5.0.0/
[U5.1.0] The Unicode Standard Version 5.1.0
http://www.unicode.org/versions/Unicode5.1.0/
[UTR23] The Unicode Character Property Model
http://www.unicode.org/reports/tr23/
[Versions] Versions of the Unicode Standard
http://www.unicode.org/versions/
For details on the precise contents of each version of the Unicode Standard, and how to cite them.
[Width] UAX #11: East Asian Width
Latest Version:
http://www.unicode.org/reports/tr11/
5.1.0 version:
http://www.unicode.org/reports/tr11/tr11-16.html


Modification History

This section provides a summary of the changes between update versions of the Unicode Standard. The modifications prior to Unicode 4.0 only listed changes in UnicodeData.txt. From 4.0 onward, the consolidated modifications include the changes in other files.

Unicode 5.1.0

This document:

Common file changes:

TBD

Changes in specific files:

TBD

Unicode 5.0.0

This document:

Common file changes:

In many data files an explicit default property assignment range was added (in a machine-readable comment line), to assist implementations in assigning values for code points not otherwise listed in the data file.

Changes in specific files:

In some of the following entries, references are made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

Appropriate data files were updated to include the 1369 new characters added in Unicode 5.0.

Two new data files, NameAliases.txt and NamedSequencesProv.txt, were added to the UCD.

Unicode 4.1.0

This document:

Common file changes:

All remaining files not corrected for Unicode 4.0.1 have had their headers updated to explicitly point to Terms of Use. The headers have also been synchronized somewhat to share a more common format for file version, date, and pointers to documentation. The major exception is UnicodeData.txt, which for legacy reasons, has no header.

Changes in specific files:

In some of the following, reference is made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

Appropriate data files were updated to include the 1273 new characters added in Unicode 4.1.

The description of the Unihan properties was separated out from UCD.html, and extensively revised, and now appears in Unihan.html.

An auxiliary directory has been added. In 4.1.0 it contains properties associated with UAX #29: Text Boundaries [Breaks].

Unicode 4.0.1

This document:

Common file changes:

Some property values have different casing (upper vs. lower) for consistency between the data files and the PropertyValueAlias file. There are some additional changes in comments:

Changes in specific files:

In some of the following, reference is made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about those cases.

Unicode 4.0

Unicode 3.2

Modifications made for Version 3.2.0 of UnicodeData.txt include:

Unicode 3.1.1

Modifications made for Version 3.1.1 of UnicodeData.txt include:

Unicode 3.1

Modifications made for Version 3.1.0 of UnicodeData.txt include:

Unicode 3.0.1

Modifications made for Version 3.0.1 of UnicodeData.txt include:

Unicode 3.0.0

Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and a number of property changes. These are summarized in Appendix D of The Unicode Standard, Version 3.0.

Unicode 2.1.9

Modifications made for Version 2.1.9 of UnicodeData.txt include:

Unicode 2.1.8

Modifications made for Version 2.1.8 of UnicodeData.txt include:

Version 2.1.7

This version was for internal change tracking only, and never publicly released.

Version 2.1.6

This version was for internal change tracking only, and never publicly released.

Unicode 2.1.5

Modifications made for Version 2.1.5 of UnicodeData.txt include:

Version 2.1.4

This version was for internal change tracking only, and never publicly released.

Version 2.1.3

This version was for internal change tracking only, and never publicly released.

Unicode 2.1.2

Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode Standard, Version 2.1 (from Version 2.0) include:

Version 2.1.1

This version was for internal change tracking only, and never publicly released.

Unicode 2.0.0

The modifications made in updating UnicodeData.txt for the Unicode Standard, Version 2.0 include:

UCD Terms of Use

For terms of use, see http://www.unicode.org/terms_of_use.html.


Access to Copyright and terms of use