DRAFT Unicode Technical Report #15

Unicode Normalization Forms

Revision 9
Authors Mark Davis (mark@unicode.org)
Date 1998-11-23
This Version http://www.unicode.org/unicode/reports/tr15/tr15-9.html
Previous Version http://www.unicode.org/unicode/reports/tr15-8
Latest Version http://www.unicode.org/unicode/reports/tr15

Summary

This document describes specifications for four normalized forms of Unicode text. The design is in public review phase. We welcome review feedback.

Status of this document

This draft is published for review purposes. Previous versions of this draft have been considered by the Unicode Technical Committee, but no final decision has been reached. At its next meeting, the Unicode Technical Committee may approve, reject, or further amend this document.

In particular, the precise version of the character database referenced in the text may change. The current deadline for setting the version is March, 1999.

The content of this technical report must be understood in the context of the latest version of the Unicode Standard, Version 2.1. This version is defined by the Unicode Standard 2.0 book, plus the online update to Version 2.1 on http://www.unicode.org/unicode/reports/tr8.html. Corrections and extensions to the Unicode Standard will also be found in the following areas:

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.


Introduction

The Unicode Standard Version 2.1 describes several forms of normalization in Section 5.9. Two of these forms are precisely specified in Section 3.9. In particular, the standard defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence with the original unnormalized text.

The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence with the original unnormalized text. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.

Both of these formats are normalizations to decomposed characters. While Section 3.9 also discusses a normalization to composite characters, it does not precisely specify the format. Because of the nature of the precomposed forms in the Unicode Standard, there is more than one possible specification for a normalized form with composite characters. This document provides a unique specification for those forms, and a label for each normalized form.


Labels

The four normalization forms are labeled as follows.

Title

Description

Specification

Normalization Form D Canonical Decomposition Sections 3.6 and 3.9 of The Unicode Standard
Normalization Form DC Compatibility Decomposition Sections 3.6 and 3.9 of The Unicode Standard
Normalization Form C Canonical Decomposition,
followed by Canonical Composition
see below
Normalization Form CC Compatibility Decomposition,
followed by Canonical Composition
see below

As with decomposition, there are two forms of normalization to composite characters, Form C and Form CC. The difference between these depends on whether the resulting text is to be a canonical equivalent to the original unnormalized text, or is to be a compatibility equivalent to the original unnormalized text. Both types of normalization can be useful in different circumstances.

Normalization Form C is basically the form of text which maintains canonical composite characters, and maintains the distinction between characters that are compatibility equivalents. Implementations of Unicode which restrict themselves to a repertoire containing no combining marks and which declare themselves to be implementations at Level 1 as defined in ISO/IEC 10646-1 are already basically in Normalization Form C.

Normalization Form CC levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will have the same compatibility composition, as will Roman Numerals and their letter equivalents. More complete examples are provided below.

To summarize the treatment of compability characters that were in the source text:


Notation

We will use the following notation for brevity.

Unicode names are shortened, such as the following:
      E-grave LATIN CAPITAL LETTER E WITH GRAVE
  ka KATAKANA LETTER KA
hw_ka HALFWIDTH KATAKANA LETTER KA
ten COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
hw_ten HALFWIDTH KATAKANA VOICED SOUND MARK
A sequence of characters may be represented by using plus signs between the character names, or by using string notation.
"...\uXXXX..." represents the Unicode character U+XXXX embedded within a string.
A single character which is equivalent to the sequence of characters B + C may be written as B-C.
The normalization forms for a string X can be abbreviated as ND(X), NDC(X), NC(X) and NCC(X), respectively.
Conjoining jamo of various types (initial, medial, final) are represented by subscripts, such as ki, am, and kf.


Definitions

  1. A preferred composite is a composite character whose decomposition in the Unicode Character Database is not a single character (before recursive decomposition).
  2. A canonical composite is a composite character that has a canonical decomposition.
     
  3. A compatibility composite is a composite character that has a compability decomposition.
     
  4. A primary composite is a preferred canonical composite whose canonical decomposition starts with a character of canonical combining class zero.
     
  5. Given a sequence of characters S = <C0, C1,...,Cn,Cn+1>, the character Cn+1 can be primary canonically combined with C0 if there is a primary composite X such that the sequence <X, C1,...,Cn> is canonically equivalent to S.

In such a case, X is said to be the primary canonical composition of C1 and Cn.

All of the above definitions are according to the rules for equivalence and decomposition found in Chapter 3 of The Unicode Standard, version 2.0, and the decomposition mappings in the Unicode Character Database.

Decomposition must be done in accordance with these rules. In particular, the decompositions found in the Unicode Character Database must be applied recursively, and then put into canonical order.

Hangul syllable decomposition is considered a canonical decomposition. See Technical Report #8: The Unicode Standard Version 2.1 (http://www.unicode.org/unicode/reports/tr8.html).


Design Goals

The first major design goal for the normalization forms is uniqueness: two equivalent strings will have the same normalization form. This is a required goal.

Goal 1: Uniqueness

  1. If a string X is a canonical equivalent of a string Y, then all of the following are true:
  2. If a string X is a compatibility equivalent of a string Y, then both of the following are true:

The second major design goal for the normalization forms is stability. This goal is highly desired, but not required.

Goal 2: Stability

  1. If X contains a compatibility composite, then ND(X) and NC(X) contain that composite.
     
  2. If the only composite characters in X are primary and there are no combining characters, then NC(X) = X.

There are four exceptions to Goal 2.1 in the Unicode Standard Version 2.1, according to the definition below. Four new characters are being proposed to remedy this situation by the time the database version is fixed. These are:

0226 LATIN CAPITAL LETTER A WITH DOT ABOVE
0227 LATIN SMALL LETTER A WITH DOT ABOVE
0228 LATIN CAPITAL LETTER E WITH CEDILLA
0229 LATIN SMALL LETTER E WITH CEDILLA

The third major design goal for the normalization forms is efficiency. This goal is highly desired, but not required.

Goal 3: Efficiency

  1. It is possible to implement the Normalization Forms in an efficient manner. In particular, it should be possible to produce Normalization Form C quickly from strings that are in either Normalization Form D or C.


Versioning

Because additional composite characters may be added to future versions of the Unicode standard, composition is less stable than decomposition. Therefore, it is necessary to specify a fixed version for the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.

Decomposition is only instable if an existing character decomposition mapping changes. The Unicode Technical Committee has the policy of carefully reviewing proposed corrections in character decompositions, and only making changes where the benefits very clearly outweigh the drawbacks.

The fixed version of the composition process is defined by reference to a particular version of the Unicode Character Database. At this point, that version is specified to be the 2.1.5 version, the content of the file UnicodeData-2.1.5.txt (abbreviated as UCD2.1.5). For more information, see:

For example, suppose that Unicode version 3.0 adds the composite H-caron. For an implementation that uses Unicode version 3.0, strings in Normalization Forms C or CC will continue to contain the sequence H + caron, and not the new character H-caron, since a canonical composition for H-caron is not defined in UCD2.1.5.


Process

Typical strings of composite accented Unicode characters are already in Normalization Form C. However, there are circumstances with possible ambiguities, which requires the precise specification in this document.

Logically, the process of forming a composition in Normalization Form C or CC involves:

This is the logical description of the process--implementations are free to use more efficient algorithms as long as the result is the same.

Normalization Form CC does not attempt to map characters to compatibility composites. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a character that is the compatibility equivalent of the sequence of three characters 'ffi'.


Conformance

A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in this document.

A process that tests Unicode text to determine whether it is a in a Normalization Form shall do so in accordance with the specifications in this document.

The specifications for Normalization Forms are written in terms of a process for producing a decomposition or composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result. Similarly, testing for a particular Normalization Form does not require applying the process of normalization, so long as the result of the test is equivalent to applying normalization and then testing bit-for-bit identity.


Specification

The Normalization Form C for a string S is defined by the following process.

  1. Generate the canonical decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through that decomposition character by character. Test each character C for primary canonical combination according to the decomposition mappings in the version 2.1.5 of the Unicode Character Database.

The result of this process is a new string S' which is in Normalization Form C.


The Normalization Form CC for a string S is defined by the following process.

  1. Generate the compatibility decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through that decomposition character by character. Test each character C for primary canonical combination according to the decomposition mappings in the version 2.1.5 of the Unicode Character Database.

The result of this process is a new string S' which is in Normalization Form CC.


Normalization Form C Examples:

Original Decomposed Composed

Notes

a D-dot_above D + dot_above D-dot_above Both decomposed and precomposed canonical sequences produce the same result.
b D + dot_above D + dot_above D-dot_above
c D-dot_below + dot_above D + dot_below + dot_above D-dot_below + dot_above

By the time we have gotten to dot_above, it cannot be combined with the base character.

There may be intervening combining marks (see f), so long as the result of the combination is canonically equivalent.

d D-dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
e D + dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
f D + dot_above+ ogonek + dot_below D + ogonek + dot_below + dot_above D-dot_below + ogonek + dot_above
g E-macron-grave E + macron + grave E-macron-grave

Multiple combining characters are combined with successive base characters.

Characters will not be combined (i) if they would not be canonical equivalents because of their ordering.

h E-macron + grave E + macron + grave E-macron-grave
i E-grave + macron E + grave + macron E-grave + macron
j angstrom_sign A + ring A-ring Since Å (A-ring) is the preferred composite, it is the form produced for both characters.
k A-ring A + ring A-ring
l "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is not decomposed, since it has a compatibility mapping, not a canonical mapping. (See Normalization Form CC Examples.)
m "Ä\uFB03n" "A\u0308\uFB03n" "Ä\uFB03n"
n "Henry IV" "Henry IV" "Henry IV" Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed.
o "Henry \u2163" "Henry \u2163" "Henry \u2163"
p ga ka + ten ga Different compatibility equivalents of a single Japanese character will not result in the same string in Normalization Form C.
q ka + ten ka + ten ga
r hw_ka + hw_ten hw_ka + hw_ten hw_ka + hw_ten
s ka + hw_ten ka + hw_ten ka + hw_ten
t hw_ka + ten hw_ka + ten hw_ka + ten
u kakk ki + am + kkf kakk

Hangul syllables are maintained.


Normalization Form CC Examples

Cases (a-k) above are the same in both Normalization Form C and CC, and are not repeated here.

Original Decomposed Composed

Notes

l' "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is decomposed in Normalization Form CC (where it is not in Normalization Form C).
m' "Ä\uFB03n" "A\u0308\ffin" "Äffin"
n' "Henry IV" "Henry IV" "Henry IV" Similarly, the resulting strings here are identical in Normalization Form CC.
o' "Henry \u2163" "Henry IV" "Henry IV"
p' ga ka + ten ga Different compatibility equivalents of a single Japanese character will result in the same string in Normalization Form CC.
q' ka + ten ka + ten ga
r' hw_ka + hw_ten ka + ten ga
s' ka + hw_ten ka + ten ga
t' hw_ka + ten ka + ten ga
u' kakk ki + am + kf + kf kak + kf

Hangul syllables are not maintained.


Copyright

Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/