DRAFT Unicode Technical Report #15

Unicode Normalization Forms

Revision 10
Authors Mark Davis (mark@unicode.org)
Date 1998-12-16
This Version http://www.unicode.org/unicode/reports/tr15/tr15-10.html
Previous Version http://www.unicode.org/unicode/reports/tr15/tr15-9.html
Latest Version http://www.unicode.org/unicode/reports/tr15

Summary

This document describes specifications for four normalized forms of Unicode text. The design is in public review phase. We welcome review feedback.

Status of this document

This draft is published for review purposes. Previous versions of this draft have been considered by the Unicode Technical Committee, but no final decision has been reached. At its next meeting, the Unicode Technical Committee may approve, reject, or further amend this document.

In particular, the precise version of the character database referenced in the text may change. The current deadline for setting the version is March, 1999. It is expected that the version will change from 2.1.8 to 3.0.

The content of this technical report must be understood in the context of the latest version of the Unicode Standard, Version 2.1. This version is defined by the Unicode Standard 2.0 book, plus the online update to Version 2.1 on http://www.unicode.org/unicode/reports/tr8.html. Corrections and extensions to the Unicode Standard will also be found in the following areas:

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.

Introduction

The Unicode Standard Version 2.1 describes several forms of normalization in Section 5.9. Two of these forms are precisely specified in Section 3.9. In particular, the standard defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence with the original unnormalized text.

The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence with the original unnormalized text. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.

Both of these formats are normalizations to decomposed characters. While Section 3.9 also discusses a normalization to composite characters (also known as decomposible or precomposed characters), it does not precisely specify the format. Because of the nature of the precomposed forms in the Unicode Standard, there is more than one possible specification for a normalized form with composite characters. This document provides a unique specification for those forms, and a label for each normalized form.

This document is divided into the following sections:

Labels

The four normalization forms are labeled as follows.

Title

Description

Specification

Normalization Form D Canonical Decomposition Sections 3.6 and 3.9 of The Unicode Standard
Normalization Form DC Compatibility Decomposition Sections 3.6 and 3.9 of The Unicode Standard
Normalization Form C Canonical Decomposition,
followed by Canonical Composition
see below
Normalization Form CC Compatibility Decomposition,
followed by Canonical Composition
see below

As with decomposition, there are two forms of normalization to composite characters, Form C and Form CC. The difference between these depends on whether the resulting text is to be a canonical equivalent to the original unnormalized text, or is to be a compatibility equivalent to the original unnormalized text. Both types of normalization can be useful in different circumstances.

Normalization Form C is basically the form of text which uses canonical composite characters where possible, and maintains the distinction between characters that are compatibility equivalents. Implementations of Unicode which restrict themselves to a repertoire containing no combining marks and which declare themselves to be implementations at Level 1 as defined in ISO/IEC 10646-1 are already using Normalization Form C.

Normalization Form CC levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will normalize to the same strings, as will Roman Numerals and their letter equivalents. More complete examples are provided below. However, there is loss of information when text is transformed into Normalization Form CC, so it is not recommended for all circumstances.

To summarize the treatment of compability characters that were in the source text:

Notation

We will use the following notation for brevity.

* Unicode names are shortened, such as the following:
      E-grave LATIN CAPITAL LETTER E WITH GRAVE
  ka KATAKANA LETTER KA
hw_ka HALFWIDTH KATAKANA LETTER KA
ten COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
hw_ten HALFWIDTH KATAKANA VOICED SOUND MARK
* A sequence of characters may be represented by using plus signs between the character names, or by using string notation.
* "...\uXXXX..." represents the Unicode character U+XXXX embedded within a string.
* A single character which is equivalent to the sequence of characters B + C may be written as B-C.
* The normalization forms for a string X can be abbreviated as ND(X), NDC(X), NC(X) and NCC(X), respectively.
* Conjoining jamo of various types (initial, medial, final) are represented by subscripts, such as ki, am, and kf.

Definitions

In the Unicode Character Database, two characters may have the same canonical decomposition. Here is an example of this:

Source Decomposition
212B ('Å' ANGSTROM SIGN) =>

0041 ('A' LATIN CAPITAL LETTER A)
+
030A ('°' COMBINING RING ABOVE)
00C5 ('Å' LATIN CAPITAL LETTER A WITH RING ABOVE) =>

However, in such cases, the Unicode Character Database will first decompose one of the characters to the other, and then decompose from there. That is, one of the characters (in this case ANGSTROM SIGN) will have a single-character decomposition, which indicates that the character was only encoded for compatibility with existing standards. This single-character decomposition is used to resolve ambiguity when composing.

  1. A primary composite is a character that has a canonical decomposition in the Unicode Character Database, where that decomposition starts with a character of canonical combining class zero, and is not a single character (before recursive decomposition).

    Note to reviewers: before this document is made final, it is likely that additional characters need to be excluded from being primary composites, such as FB31 HEBREW LETTER BET WITH DAGESH. This would be to match common practice for scripts that use fully decomposed forms. If this step is taken, then an additional data table will be added to list the excluded characters.
     
  2. Given a sequence of characters S = <C0, C1,...,Cn,Cn+1>, the character Cn+1 can be primary canonically combined with C0 if there is a primary composite X such that the sequence <X, C1,...,Cn> is canonically equivalent to S.

    In such a case, X is said to be the primary canonical composition of C1 and Cn+1.

Note that because of the definition of canonical equivalence in the Unicode Standard, the definition of primary canonical composition has the following implications:

Both of the above definitions are according to the rules for equivalence and decomposition found in Chapter 3 of The Unicode Standard, Version 2.0, and the decomposition mappings in the Unicode Character Database.

Note:

Decomposition must be done in accordance with these rules. In particular, the decompositions found in the Unicode Character Database must be applied recursively, and then put into canonical order.

Hangul syllable decomposition is considered a canonical decomposition. See Technical Report #8: The Unicode Standard Version 2.1 (http://www.unicode.org/unicode/reports/tr8.html).

Design Goals

Goal 1: Uniqueness

The first major design goal for the normalization forms is uniqueness: two equivalent strings will have precisely the same normalized form. This is a required goal.

  1. If a string X is a canonical equivalent of a string Y, then all of the following are true:
  2. If a string X is a compatibility equivalent of a string Y, then both of the following are true:

Goal 2: Stability

The second major design goal for the normalization forms is stability. This goal is highly desired, but not required.

  1. If X contains a character with a compatibility decomposition, then ND(X) and NC(X) still contain that character.
     
  2. If the only decomposible characters in X are primary and there are no combining characters, then NC(X) = X.

Note:

There are four exceptions to Goal 2.2 in the Unicode Standard Version 2.1, according to the definition below. Four new characters are being proposed to remedy this situation by the time the database version is fixed, in Unicode 3.0. These are:

0226 LATIN CAPITAL LETTER A WITH DOT ABOVE
0227 LATIN SMALL LETTER A WITH DOT ABOVE
0228 LATIN CAPITAL LETTER E WITH CEDILLA
0229 LATIN SMALL LETTER E WITH CEDILLA

Goal 3: Efficiency

The third major design goal for the normalization forms is efficiency. This goal is highly desired, but not required.

  1. It is possible to implement the Normalization Forms in an efficient manner. In particular, it should be possible to produce Normalization Form C quickly from strings that are either in Normalization Form D or already in Normalization Form C.

Versioning

Because additional composite characters may be added to future versions of the Unicode standard, composition is less stable than decomposition. Therefore, it is necessary to specify a fixed version for the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.

Note: Decomposition is only instable if an existing character decomposition mapping changes. The Unicode Technical Committee has the policy of carefully reviewing proposed corrections in character decompositions, and only making changes where the benefits very clearly outweigh the drawbacks.

The fixed version of the composition process is defined by reference to a particular version of the Unicode Character Database, called the composition version. At this point, that version is specified to be the 2.1.8 version, the content of the file UnicodeData-2.1.8.txt (abbreviated as UCD2.1.8); however, the final version is expected to be Unicode 3.0. For more information, see:

To see what difference the composition version makes, suppose that Unicode version 4.0 adds the composite Q-caron. For an implementation that uses Unicode version 4.0, strings in Normalization Forms C or CC will continue to contain the sequence Q + caron, and not the new character Q-caron, since a canonical composition for Q-caron was not defined in the composition version.

Conformance

A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in this document.

A process that tests Unicode text to determine whether it is a in a Normalization Form shall do so in accordance with the specifications in this document.

Note: The specifications for Normalization Forms are written in terms of a process for producing a decomposition or composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result. Similarly, testing for a particular Normalization Form does not require applying the process of normalization, so long as the result of the test is equivalent to applying normalization and then testing bit-for-bit identity.

Specification

Typical strings of composite accented Unicode characters are already in Normalization Form C. However, there are circumstances with possible ambiguities, which requires the precise specification in this document.

Basically, the process of forming a composition in Normalization Form C or CC involves:

This is specified more precisely below. The specification is the logical description of the process--implementations are free to use more efficient algorithms as long as the result is the same.

Note: Normalization Form CC does not attempt to map characters to compatibility composites. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a character that is the compatibility equivalent of the sequence of three characters 'ffi'.

The Normalization Form C for a string S is defined by the following process.

  1. Generate the canonical decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through that decomposition character by character. Test each character C for primary canonical combination according to the decomposition mappings in the composition version of the Unicode Character Database.

The result of this process is a new string S' which is in Normalization Form C.

The Normalization Form CC for a string S is defined by the following process.

  1. Generate the compatibility decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through that decomposition character by character. Test each character C for primary canonical combination according to the decomposition mappings in the composition version of the Unicode Character Database.

The result of this process is a new string S' which is in Normalization Form CC.

Examples

Normalization Form C Examples:

Original Decomposed Composed

Notes

a D-dot_above D + dot_above D-dot_above Both decomposed and precomposed canonical sequences produce the same result.
b D + dot_above D + dot_above D-dot_above
c D-dot_below + dot_above D + dot_below + dot_above D-dot_below + dot_above

By the time we have gotten to dot_above, it cannot be combined with the base character.

There may be intervening combining marks (see f), so long as the result of the combination is canonically equivalent.

d D-dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
e D + dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
f D + dot_above+ ogonek + dot_below D + ogonek + dot_below + dot_above D-dot_below + ogonek + dot_above
g E-macron-grave E + macron + grave E-macron-grave

Multiple combining characters are combined with successive base characters.

Characters will not be combined (i) if they would not be canonical equivalents because of their ordering.

h E-macron + grave E + macron + grave E-macron-grave
i E-grave + macron E + grave + macron E-grave + macron
j angstrom_sign A + ring A-ring Since Å (A-ring) is the preferred composite, it is the form produced for both characters.
k A-ring A + ring A-ring
l "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is not decomposed, since it has a compatibility mapping, not a canonical mapping. (See Normalization Form CC Examples.)
m "Ä\uFB03n" "A\u0308\uFB03n" "Ä\uFB03n"
n "Henry IV" "Henry IV" "Henry IV" Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed.
o "Henry \u2163" "Henry \u2163" "Henry \u2163"
p ga ka + ten ga Different compatibility equivalents of a single Japanese character will not result in the same string in Normalization Form C.
q ka + ten ka + ten ga
r hw_ka + hw_ten hw_ka + hw_ten hw_ka + hw_ten
s ka + hw_ten ka + hw_ten ka + hw_ten
t hw_ka + ten hw_ka + ten hw_ka + ten
u kaks ki + am + ksf kaks

Hangul syllables are maintained.

Normalization Form CC Examples

Cases (a-k) above are the same in both Normalization Form C and CC, and are not repeated here.

Original Decomposed Composed

Notes

l' "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is decomposed in Normalization Form CC (where it is not in Normalization Form C).
m' "Ä\uFB03n" "A\u0308\ffin" "Äffin"
n' "Henry IV" "Henry IV" "Henry IV" Similarly, the resulting strings here are identical in Normalization Form CC.
o' "Henry \u2163" "Henry IV" "Henry IV"
p' ga ka + ten ga Different compatibility equivalents of a single Japanese character will result in the same string in Normalization Form CC.
q' ka + ten ka + ten ga
r' hw_ka + hw_ten ka + ten ga
s' ka + hw_ten ka + ten ga
t' hw_ka + ten ka + ten ga
u' kakk ki + am + ksf kaks

Hangul syllables are maintained (Unicode Version 2.1.8 and later!)

Copyright

Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/