|Authors||Mark Davis (firstname.lastname@example.org)|
This document describes specifications for four normalized forms of Unicode text. The design is in public review phase. We welcome review feedback.
This draft is published for review purposes. Previous versions of this draft have been considered by the Unicode Technical Committee, but no final decision has been reached. At its next meeting, the Unicode Technical Committee may approve, reject, or further amend this document.
In particular, the precise version of the character database referenced in the text may change. The current deadline for setting the version is March, 1999. It is expected that the version will change from 2.1.8 to 3.0.
The content of this technical report must be understood in the context of the latest version of the Unicode Standard, Version 2.1. This version is defined by the Unicode Standard 2.0 book, plus the online update to Version 2.1 on http://www.unicode.org/unicode/reports/tr8.html. Corrections and extensions to the Unicode Standard will also be found in the following areas:
This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to email@example.com.
The Unicode Standard Version 2.1 describes several forms of normalization in Section 5.9. Two of these forms are precisely specified in Section 3.9. In particular, the standard defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence with the original unnormalized text.
The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence with the original unnormalized text. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.
Both of these formats are normalizations to decomposed characters. While Section 3.9 also discusses a normalization to composite characters (also known as decomposible or precomposed characters), it does not precisely specify the format. Because of the nature of the precomposed forms in the Unicode Standard, there is more than one possible specification for a normalized form with composite characters. This document provides a unique specification for those forms, and a label for each normalized form.
This document is divided into the following sections:
The four normalization forms are labeled as follows.
|Normalization Form D||Canonical Decomposition||Sections 3.6 and 3.9 of The Unicode Standard|
|Normalization Form DC||Compatibility Decomposition||Sections 3.6 and 3.9 of The Unicode Standard|
|Normalization Form C||Canonical Decomposition,
followed by Canonical Composition
|Normalization Form CC||Compatibility Decomposition,
followed by Canonical Composition
As with decomposition, there are two forms of normalization to composite characters, Form C and Form CC. The difference between these depends on whether the resulting text is to be a canonical equivalent to the original unnormalized text, or is to be a compatibility equivalent to the original unnormalized text. Both types of normalization can be useful in different circumstances.
Normalization Form C is basically the form of text which uses canonical composite characters where possible, and maintains the distinction between characters that are compatibility equivalents. Implementations of Unicode which restrict themselves to a repertoire containing no combining marks and which declare themselves to be implementations at Level 1 as defined in ISO/IEC 10646-1 are already using Normalization Form C.
Normalization Form CC levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will normalize to the same strings, as will Roman Numerals and their letter equivalents. More complete examples are provided below. However, there is loss of information when text is transformed into Normalization Form CC, so it is not recommended for all circumstances.
To summarize the treatment of compability characters that were in the source text:
We will use the following notation for brevity.
|Unicode names are shortened, such as the following: |
|A sequence of characters may be represented by using plus signs between the character names, or by using string notation.|
|"...\uXXXX..." represents the Unicode character U+XXXX embedded within a string.|
|A single character which is equivalent to the sequence of characters B + C may be written as B-C.|
|The normalization forms for a string X can be abbreviated as ND(X), NDC(X), NC(X) and NCC(X), respectively.|
|Conjoining jamo of various types (initial, medial, final) are represented by subscripts, such as ki, am, and kf.|
In the Unicode Character Database, two characters may have the same canonical decomposition. Here is an example of this:
|212B ('Å' ANGSTROM SIGN)||=>|
030A ('°' COMBINING RING ABOVE)
|00C5 ('Å' LATIN CAPITAL LETTER A WITH RING ABOVE)||=>|
However, in such cases, the Unicode Character Database will first decompose one of the characters to the other, and then decompose from there. That is, one of the characters (in this case ANGSTROM SIGN) will have a single-character decomposition, which indicates that the character was only encoded for compatibility with existing standards. This single-character decomposition is used to resolve ambiguity when composing.
Note that because of the definition of canonical equivalence in the Unicode Standard, the definition of primary canonical composition has the following implications:
- C0 must have a combining class of zero,
- none of C1,...,Cn can have a combining class of zero
- Cn+1 may or may not have a combining class of zero.
Both of the above definitions are according to the rules for equivalence and decomposition found in Chapter 3 of The Unicode Standard, Version 2.0, and the decomposition mappings in the Unicode Character Database.
Decomposition must be done in accordance with these rules. In particular, the decompositions found in the Unicode Character Database must be applied recursively, and then put into canonical order.
Hangul syllable decomposition is considered a canonical decomposition. See Technical Report #8: The Unicode Standard Version 2.1 (http://www.unicode.org/unicode/reports/tr8.html).
The first major design goal for the normalization forms is uniqueness: two equivalent strings will have precisely the same normalized form. This is a required goal.
The second major design goal for the normalization forms is stability. This goal is highly desired, but not required.
There are four exceptions to Goal 2.2 in the Unicode Standard Version 2.1, according to the definition below. Four new characters are being proposed to remedy this situation by the time the database version is fixed, in Unicode 3.0. These are:
The third major design goal for the normalization forms is efficiency. This goal is highly desired, but not required.
Because additional composite characters may be added to future versions of the Unicode standard, composition is less stable than decomposition. Therefore, it is necessary to specify a fixed version for the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.
|Decomposition is only instable if an existing character decomposition mapping changes. The Unicode Technical Committee has the policy of carefully reviewing proposed corrections in character decompositions, and only making changes where the benefits very clearly outweigh the drawbacks.|
The fixed version of the composition process is defined by reference to a particular version of the Unicode Character Database, called the composition version. At this point, that version is specified to be the 2.1.8 version, the content of the file UnicodeData-2.1.8.txt (abbreviated as UCD2.1.8); however, the final version is expected to be Unicode 3.0. For more information, see:
To see what difference the composition version makes, suppose that Unicode version 4.0 adds the composite Q-caron. For an implementation that uses Unicode version 4.0, strings in Normalization Forms C or CC will continue to contain the sequence Q + caron, and not the new character Q-caron, since a canonical composition for Q-caron was not defined in the composition version.
A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in this document.
A process that tests Unicode text to determine whether it is a in a Normalization Form shall do so in accordance with the specifications in this document.
|The specifications for Normalization Forms are written in terms of a process for producing a decomposition or composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result. Similarly, testing for a particular Normalization Form does not require applying the process of normalization, so long as the result of the test is equivalent to applying normalization and then testing bit-for-bit identity.|
Typical strings of composite accented Unicode characters are already in Normalization Form C. However, there are circumstances with possible ambiguities, which requires the precise specification in this document.
Basically, the process of forming a composition in Normalization Form C or CC involves:
This is specified more precisely below. The specification is the logical description of the process--implementations are free to use more efficient algorithms as long as the result is the same.
|Normalization Form CC does not attempt to map characters to compatibility composites. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a character that is the compatibility equivalent of the sequence of three characters 'ffi'.|
The result of this process is a new string S' which is in Normalization Form C.
The result of this process is a new string S' which is in Normalization Form CC.
|a||D-dot_above||D + dot_above||D-dot_above||Both decomposed and precomposed canonical sequences produce the same result.|
|b||D + dot_above||D + dot_above||D-dot_above|
|c||D-dot_below + dot_above||D + dot_below + dot_above||D-dot_below + dot_above|
By the time we have gotten to dot_above, it cannot be combined with the base character.
There may be intervening combining marks (see f), so long as the result of the combination is canonically equivalent.
|d||D-dot_above + dot_below||D + dot_below + dot_above||D-dot_below + dot_above|
|e||D + dot_above + dot_below||D + dot_below + dot_above||D-dot_below + dot_above|
|f||D + dot_above+ ogonek + dot_below||D + ogonek + dot_below + dot_above||D-dot_below + ogonek + dot_above|
|g||E-macron-grave||E + macron + grave||E-macron-grave|
Multiple combining characters are combined with successive base characters.
Characters will not be combined (i) if they would not be canonical equivalents because of their ordering.
|h||E-macron + grave||E + macron + grave||E-macron-grave|
|i||E-grave + macron||E + grave + macron||E-grave + macron|
|j||angstrom_sign||A + ring||A-ring||Since Å (A-ring) is the preferred composite, it is the form produced for both characters.|
|k||A-ring||A + ring||A-ring|
|l||"Äffin"||"A\u0308ffin"||"Äffin"||The ffi_ligature (U+FB03) is not decomposed, since it has a compatibility mapping, not a canonical mapping. (See Normalization Form CC Examples.)|
|n||"Henry IV"||"Henry IV"||"Henry IV"||Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed.|
|o||"Henry \u2163"||"Henry \u2163"||"Henry \u2163"|
|p||ga||ka + ten||ga||Different compatibility equivalents of a single Japanese character will not result in the same string in Normalization Form C.|
|q||ka + ten||ka + ten||ga|
|r||hw_ka + hw_ten||hw_ka + hw_ten||hw_ka + hw_ten|
|s||ka + hw_ten||ka + hw_ten||ka + hw_ten|
|t||hw_ka + ten||hw_ka + ten||hw_ka + ten|
|u||kaks||ki + am + ksf||kaks|
Hangul syllables are maintained.
Cases (a-k) above are the same in both Normalization Form C and CC, and are not repeated here.
|l'||"Äffin"||"A\u0308ffin"||"Äffin"||The ffi_ligature (U+FB03) is decomposed in Normalization Form CC (where it is not in Normalization Form C).|
|n'||"Henry IV"||"Henry IV"||"Henry IV"||Similarly, the resulting strings here are identical in Normalization Form CC.|
|o'||"Henry \u2163"||"Henry IV"||"Henry IV"|
|p'||ga||ka + ten||ga||Different compatibility equivalents of a single Japanese character will result in the same string in Normalization Form CC.|
|q'||ka + ten||ka + ten||ga|
|r'||hw_ka + hw_ten||ka + ten||ga|
|s'||ka + hw_ten||ka + ten||ga|
|t'||hw_ka + ten||ka + ten||ga|
|u'||kakk||ki + am + ksf||kaks|
Hangul syllables are maintained (Unicode Version 2.1.8 and later!)
Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.
The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.
Unicode Home Page: http://www.unicode.org
Unicode Technical Reports: http://www.unicode.org/unicode/reports/