DRAFT Unicode Technical Report #15

Unicode Normalization Forms

Revision 14

Authors Mark Davis (mark@unicode.org)

Date 1999-03-22

This Version http://www.unicode.org/unicode/reports/tr15/tr15-14.html

Previous Version http://www.unicode.org/unicode/reports/tr15/tr15-13.html

Latest Version http://www.unicode.org/unicode/reports/tr15

Unicode Technical Reports http://www.unicode.org/unicode/reports/

Summary

This document describes specifications for four normalized forms of Unicode text. The design is in public review phase. We welcome review feedback.

Status of this document

This draft is published for review purposes. Previous versions of this draft have been considered by the Unicode Technical Committee, but no final decision has been reached. At its next meeting, the Unicode Technical Committee may approve, reject, or further amend this document.

In particular, the precise version of the character database referenced in the text may change. The current deadline for setting the version is April, 1999. It is expected that the version will change from 2.1.8 to 3.0.

The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See http://www.unicode.org/unicode/standard/versions/ for more information.

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.

Introduction

This document is divided into the following sections:

Introduction
Notation
Definitions
Versioning
Conformance
Specification
Composition Exclusion Table
Annex: Examples
Annex: Design Goals
Annex: Implementation Notes
Annex: Decomposition
Annex: Trailing Characters
Annex: Intellectual Property

The Unicode Standard Version 2.1 describes several forms of normalization in Section 5.9. Two of these forms are precisely specified in Section 3.9. In particular, the standard defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence with the original unnormalized text.

The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence with the original unnormalized text. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.

Both of these formats are normalizations to decomposed characters. While Section 3.9 also discusses a normalization to composite characters (also known as decomposible or precomposed characters), it does not precisely specify the format. Because of the nature of the precomposed forms in the Unicode Standard, there is more than one possible specification for a normalized form with composite characters. This document provides a unique specification for those forms, and a label for each normalized form.

The four normalization forms are labeled as follows.

Title

Description

Specification

Normalization Form D Canonical Decomposition Sections 3.6, 3.9, and 3.10 of The Unicode Standard, also summarized under Decomposition

Normalization Form C Canonical Decomposition,
followed by Canonical Composition see Specification

Normalization Form KD Compatibility Decomposition Sections 3.6, 3.9, and 3.10 of The Unicode Standard, also summarized under Decomposition

Normalization Form KC Compatibility Decomposition,
followed by Canonical Composition see Specification

Title	Description	Specification
Normalization Form D	Canonical Decomposition	Sections 3.6, 3.9, and 3.10 of The Unicode Standard, also summarized under Decomposition
Normalization Form C	Canonical Decomposition, followed by Canonical Composition	see Specification
Normalization Form KD	Compatibility Decomposition	Sections 3.6, 3.9, and 3.10 of The Unicode Standard, also summarized under Decomposition
Normalization Form KC	Compatibility Decomposition, followed by Canonical Composition	see Specification

As with decomposition, there are two forms of normalization to composite characters, Form C and Form KC. The difference between these depends on whether the resulting text is to be a canonical equivalent to the original unnormalized text, or is to be a compatibility equivalent to the original unnormalized text. (In KC and KD, a K is used to stand for compatibility to avoid confusion with the C standing for canonical.) Both types of normalization can be useful in different circumstances.

Normalization Form C is basically the form of text which uses canonical composite characters where possible, and maintains the distinction between characters that are compatibility equivalents. Typical strings of composite accented Unicode characters are already in Normalization Form C. Implementations of Unicode which restrict themselves to a repertoire containing no combining marks (such as those that declare themselves to be implementations at Level 1 as defined in ISO/IEC 10646-1) are already typically using Normalization Form C. (Implementations of later versions of 10646 need to be aware of the versioning issues--see Versioning.) This is also the form of normalization currently chosen for use in W3C specifications; see the W3C Character Model document (http://www.w3.org/TR/WD-charmod) and the W3C Character Requirements document (http://www.w3.org/TR/WD-charreq).

Normalization Form KC additionally levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will normalize to the same strings, as will Roman Numerals and their letter equivalents. More complete examples are provided below. However, there is loss of information when text is transformed into Normalization Form KC, so it is not recommended for all circumstances.

To summarize the treatment of compability characters that were in the source text:

Both forms D and C maintain compatibility characters.
Neither forms KD nor KC maintain compatibility characters.
None of the forms generate compability characters that were not in the source text.

Normalization Form KC does not attempt to map characters to compatibility composites. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a character that is the compatibility equivalent of the sequence of three characters 'ffi'.

Neither of the composition normalization forms C and KC are closed under string concatenation. For example, the strings "a" and "^" (combining circumflex) are both in form C, but the concatenation of the two ("a" + "^" => "a^") is not: the normalized form is the precomposed character "â". There is no way to produce a composition normalized form that is closed under simple string concatenation without disturbing other string operations. If desired, however, a specialized function could be constructed that produced a normalized concatenation.

The decomposition normalization forms D and KD are closed under string concatenation and substringing.

Notation

We will use the following notation for brevity:

Unicode names are shortened, such as the following:

E-grave	=	LATIN CAPITAL LETTER E WITH GRAVE
ka	=	KATAKANA LETTER KA
hw_ka	=	HALFWIDTH KATAKANA LETTER KA
ten	=	COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
hw_ten	=	HALFWIDTH KATAKANA VOICED SOUND MARK

The combining class of a character X may be written as CC(X)
A sequence of characters may be represented by using plus signs between the character names, or by using string notation.
"...\uXXXX..." represents the Unicode character U+XXXX embedded within a string.
A single character which is equivalent to the sequence of characters B + C may be written as B-C.
The normalization forms for a string X can be abbreviated as D(X), KD(X), C(X) and KC(X), respectively.
Conjoining jamo of various types (initial, medial, final) are represented by subscripts, such as k_i, a_m, and k_f.

Versioning

Because additional composite characters may be added to future versions of the Unicode standard, composition is less stable than decomposition. Therefore, it is necessary to specify a fixed version for the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.

Decomposition is only unstable if an existing character's decomposition mapping changes. The Unicode Technical Committee has the policy of carefully reviewing proposed corrections in character decompositions, and only making changes where the benefits very clearly outweigh the drawbacks.

The fixed version of the composition process is defined by reference to a particular version of the Unicode Character Database, called the composition version. At this point, that version is specified to be the 2.1.8 version, the content of the file UnicodeData-2.1.8.txt (abbreviated as UCD2.1.8); however, the final version is expected to be Unicode 3.0. For more information, see:

To see what difference the composition version makes, suppose that Unicode version 4.0 adds the composite Q-caron. For an implementation that uses Unicode version 4.0, strings in Normalization Forms C or KC will continue to contain the sequence Q + caron, and not the new character Q-caron, since a canonical composition for Q-caron was not defined in the composition version.

Definitions

All of the following definitions depend on the rules for equivalence and decomposition found in Chapter 3 of The Unicode Standard, Version 2.0, and the decomposition mappings in the Unicode Character Database.

Decomposition must be done in accordance with these rules. In particular, the decomposition mappings found in the Unicode Character Database must be applied recursively, and then the string put into canonical order.

Hangul syllable decomposition is considered a canonical decomposition. See Technical Report #8: The Unicode Standard Version 2.1 (http://www.unicode.org/unicode/reports/tr8.html).

D1. In any character sequence, we say that a character X blocks a character Y if:

X occurs before Y in the sequence, and either
- X has the same canonical class as Y, or
- X is a class-zero character (has canonical class equal to zero), or
- Y is a class-zero character

When X blocks Y, changing the order of X and Y would result in a character sequence that is not canonically equivalent to the original. See Section 3.9 Canonical Ordering Behavior in the Unicode Standard.

If a combining character sequence is in canonical order, then testing whether a character is blocked only requires looking at the immediately preceding character.

D2. A primary composite is a character that has a canonical decomposition mapping in the Unicode Character Database but is not in the Composition Exclusion Table.

D3. In a sequence of canonically decomposed characters S = <C₀, C₁,...,C_n,C_n+1>, the character C_n+1 can be primary canonically combined with C₀ if

there is a primary composite X such that the sequence <X, C₁,...,C_n> is canonically equivalent to S
none of C₁,...,C_n can be primary canonically combined with C₀

In such a case, X is said to be the primary canonical composition of C₀ and C_n+1.

Note that because of D2, the contents of Composition Exclusion Table, and the definition of canonical equivalence in the Unicode Standard, D3 has the following implications:

C₀ must have a combining class of zero,
none of C₁,...,C_n can have a combining class of zero
C_n+1 may or may not have a combining class of zero
and most importantly, none of C₁,...,C_n block C_n+1

While the Normalization Forms are specified for Unicode text, they can also be extended to non-Unicode (legacy) character encodings. This is based on mapping the legacy character set strings to and from Unicode.

D4. An invertible transcoding T for a legacy character set L is a mapping from strings encoded in L to strings in Unicode that has an associated mapping T^-1 such that for any string S in L, T^-1(T(S)) = S.

Typically there is a single accepted invertible transcoding for a given legacy character set. In in a few cases there may be multiple invertible transcodings: for example, JIS may have two different mappings used in different circumstances: one to preserve the '/' semantics of 2F₁₆, and one to preserve the '¥' semantics.

The character indexes in the legacy character set string may be very different than character indexes in the Unicode equivalent. For example, if a legacy string uses visual encoding for Hebrew, then its first character might be the last character in the Unicode string.

D5. Given a string S encoded in L and an invertible transcoding T for L, the Normalization Form X of S under T is defined to be the result of mapping to Unicode, normalizing to Unicode Normalization Form X, and mapping back to the legacy character encoding, e.g., T^-1(X(T(S))). Where there is a single accepted invertible transcoding for that character set, we can simply speak of the Normalization Form X of S.

Legacy character sets fall into three categories based on their normalization behavior:

Unnormalizable. Some strings in the character set cannot be normalized into Form X.
For example, ISO 5426 is unnormalizable in Form C under common transcoders, since it contains combining marks but not composites.
Prenormalized. Any string in the character set is already in Normalization Form X.
For example, ISO 8859-1 is prenormalized in Form C.
Normalizable. Although the set is not prenormalized, any string in the set can be normalized to Form X.
ISO 2022 (with a mixture of ISO 5426 and ISO 8859-1) is an example of this.

Conformance

A process that produces Unicode text that purports to be in a Normalization Form shall do so in accordance with the specifications in this document.

A process that tests Unicode text to determine whether it is a in a Normalization Form shall do so in accordance with the specifications in this document.

The specifications for Normalization Forms are written in terms of a process for producing a decomposition or composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result. Similarly, testing for a particular Normalization Form does not require applying the process of normalization, so long as the result of the test is equivalent to applying normalization and then testing bit-for-bit identity.

Specification

The process of forming a composition in Normalization Form C or KC involves:

decomposing the string according to the canonical or compatibility mappings of the Unicode Character Database that corresponds to the latest version of Unicode supported by the implementation, then
composing primary composites according to the canonical mappings of the composition version of the Unicode Character Database.

This is specified more precisely below. Examples are provided below.

Normalization Form C

The Normalization Form C for a string S is obtained by applying the following process, or any other process that leads to the same result:

Generate the canonical decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
Iterate through that decomposition character by character. Test each character C for primary canonical combination according to the decomposition mappings in the composition version of the Unicode Character Database:
- If C can be primary canonically combined with the last class-zero character B,
  then replace B by the composite B-C, and remove C.

The result of this process is a new string S' which is in Normalization Form C.

Normalization Form KC

The Normalization Form KC for a string S is obtained by applying the following process, or any other process that leads to the same result:

Generate the compatibility decomposition for the source string S according to the decomposition mappings in the latest supported version of the Unicode Character Database.
Iterate through that decomposition character by character. Test each character C for primary canonical combination according to the decomposition mappings in the composition version of the Unicode Character Database:
- If C can be primary canonically combined with last class-zero character B,
  then replace B by the composite B-C, and remove C.

The result of this process is a new string S' which is in Normalization Form KC.

Composition Exclusion Table

In the Unicode Character Database, two characters may have the same canonical decomposition. Here is an example of this:

Source	Decomposition
`212B ('Å' ANGSTROM SIGN)`	=>	`0041 ('A' LATIN CAPITAL LETTER A)` + `030A ('°' COMBINING RING ABOVE)`
`00C5 ('Å' LATIN CAPITAL LETTER A WITH RING ABOVE)`	=>

However, in such cases, the Unicode Character Database will first decompose one of the characters to the other, and then decompose from there. That is, one of the characters (in this case ANGSTROM SIGN) will have a singleton decomposition. These singleton decompositions are some of the decompositions excluded from primary composition.

The characters having excluded decompositions are included in Unicode essentially for compatibility with certain pre-existing standards. They fall into three classes:

Singletons: precomposed characters whose decompositions are single characters (as described above). These can be computed from information in the the Unicode Character Database.
Non-zeros: precomposed characters whose decompositions start with a character that is not of combining class zero. These can be computed from information in the the Unicode Character Database.
Script-specifics: precomposed characters that are generally not the preferred form for given scripts.
Post Composition Version: precomposed characters that are added to Unicode after the composition version is fixed. This set is currently empty, but will be updated with each subsequent version of Unicode.

[Note: once this document is final, a machine readable form of the following table will be made available on the Unicode ftp site.]

Composition Exclusion Table

Singletons

0340 COMBINING GRAVE TONE MARK 0341 COMBINING ACUTE TONE MARK 0343 COMBINING GREEK KORONIS 0374 GREEK NUMERAL SIGN 037E GREEK QUESTION MARK 0387 GREEK ANO TELEIA 1F71 GREEK SMALL LETTER ALPHA WITH OXIA 1F73 GREEK SMALL LETTER EPSILON WITH OXIA 1F75 GREEK SMALL LETTER ETA WITH OXIA 1F77 GREEK SMALL LETTER IOTA WITH OXIA 1F79 GREEK SMALL LETTER OMICRON WITH OXIA 1F7B GREEK SMALL LETTER UPSILON WITH OXIA 1F7D GREEK SMALL LETTER OMEGA WITH OXIA 1FBB GREEK CAPITAL LETTER ALPHA WITH OXIA 1FBE GREEK PROSGEGRAMMENI 1FC9 GREEK CAPITAL LETTER EPSILON WITH OXIA 1FCB GREEK CAPITAL LETTER ETA WITH OXIA 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA 1FDB GREEK CAPITAL LETTER IOTA WITH OXIA 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA 1FEB GREEK CAPITAL LETTER UPSILON WITH OXIA 1FEE GREEK DIALYTIKA AND OXIA 1FEF GREEK VARIA 1FF9 GREEK CAPITAL LETTER OMICRON WITH OXIA 1FFB GREEK CAPITAL LETTER OMEGA WITH OXIA 1FFD GREEK OXIA 2000 EN QUAD 2001 EM QUAD 2126 OHM SIGN 212A KELVIN SIGN 212B ANGSTROM SIGN 2329 LEFT-POINTING ANGLE BRACKET 232A RIGHT-POINTING ANGLE BRACKET F900 CJK COMPATIBILITY IDEOGRAPH-F900 ..FA2D CJK COMPATIBILITY IDEOGRAPH-FA2D

Non-zeros

0E33 THAI CHARACTER SARA AM 0EB3 LAO VOWEL SIGN AM

Script-specifics

0958 DEVANAGARI LETTER QA ..095F DEVANAGARI LETTER YYA FB1F HEBREW LIGATURE YIDDISH YOD YOD PATAH FB2A HEBREW LETTER SHIN WITH SHIN DOT ..FB36 HEBREW LETTER ZAYIN WITH DAGESH FB38 HEBREW LETTER TET WITH DAGESH ..FB3C HEBREW LETTER LAMED WITH DAGESH FB3E HEBREW LETTER MEM WITH DAGESH FB40 HEBREW LETTER NUN WITH DAGESH FB41 HEBREW LETTER SAMEKH WITH DAGESH FB43 HEBREW LETTER FINAL PE WITH DAGESH FB44 HEBREW LETTER PE WITH DAGESH FB46 HEBREW LETTER TSADI WITH DAGESH ..FB4E HEBREW LETTER PE WITH RAFE

Post Composition Version
This set is currently empty, but will be updated with each subsequent version of Unicode.

Annex: Examples

Normalization Form C Examples:

Original Decomposed Composed
Notes

a D-dot_above D + dot_above D-dot_above Both decomposed and precomposed canonical sequences produce the same result.

b D + dot_above D + dot_above D-dot_above

c D-dot_below + dot_above D + dot_below + dot_above D-dot_below + dot_above
By the time we have gotten to dot_above, it cannot be combined with the base character.

There may be intervening combining marks (see f), so long as the result of the combination is canonically equivalent.

d D-dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above

e D + dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above

f D + dot_above+ horn + dot_below D + horn + dot_below + dot_above D-dot_below + horn + dot_above

g E-macron-grave E + macron + grave E-macron-grave Multiple combining characters are combined with successive base characters.

h E-macron + grave E + macron + grave E-macron-grave

i E-grave + macron E + grave + macron E-grave + macron Characters will not be combined if they would not be canonical equivalents because of their ordering.

j angstrom_sign A + ring A-ring Since Å (A-ring) is the preferred composite, it is the form produced for both characters.

k A-ring A + ring A-ring

l "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is not decomposed, since it has a compatibility mapping, not a canonical mapping. (See Normalization Form KC Examples.)

m "Ä\uFB03n" "A\u0308\uFB03n" "Ä\uFB03n"

n "Henry IV" "Henry IV" "Henry IV" Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed.

o "Henry \u2163" "Henry \u2163" "Henry \u2163"

p ga ka + ten ga Different compatibility equivalents of a single Japanese character will not result in the same string in Normalization Form C.

q ka + ten ka + ten ga

r hw_ka + hw_ten hw_ka + hw_ten hw_ka + hw_ten

s ka + hw_ten ka + hw_ten ka + hw_ten

t hw_ka + ten hw_ka + ten hw_ka + ten

u kaks k_i + a_m + ks_f kaks
Hangul syllables are maintained.

	Original	Decomposed	Composed	Notes
a	D-dot_above	D + dot_above	D-dot_above	Both decomposed and precomposed canonical sequences produce the same result.
b	D + dot_above	D + dot_above	D-dot_above
c	D-dot_below + dot_above	D + dot_below + dot_above	D-dot_below + dot_above	By the time we have gotten to dot_above, it cannot be combined with the base character. There may be intervening combining marks (see f), so long as the result of the combination is canonically equivalent.
d	D-dot_above + dot_below	D + dot_below + dot_above	D-dot_below + dot_above
e	D + dot_above + dot_below	D + dot_below + dot_above	D-dot_below + dot_above
f	D + dot_above+ horn + dot_below	D + horn + dot_below + dot_above	D-dot_below + horn + dot_above
g	E-macron-grave	E + macron + grave	E-macron-grave	Multiple combining characters are combined with successive base characters.
h	E-macron + grave	E + macron + grave	E-macron-grave
i	E-grave + macron	E + grave + macron	E-grave + macron	Characters will not be combined if they would not be canonical equivalents because of their ordering.
j	angstrom_sign	A + ring	A-ring	Since Å (A-ring) is the preferred composite, it is the form produced for both characters.
k	A-ring	A + ring	A-ring
l	"Äffin"	"A\u0308ffin"	"Äffin"	The ffi_ligature (U+FB03) is not decomposed, since it has a compatibility mapping, not a canonical mapping. (See Normalization Form KC Examples.)
m	"Ä\uFB03n"	"A\u0308\uFB03n"	"Ä\uFB03n"
n	"Henry IV"	"Henry IV"	"Henry IV"	Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed.
o	"Henry \u2163"	"Henry \u2163"	"Henry \u2163"
p	ga	ka + ten	ga	Different compatibility equivalents of a single Japanese character will not result in the same string in Normalization Form C.
q	ka + ten	ka + ten	ga
r	hw_ka + hw_ten	hw_ka + hw_ten	hw_ka + hw_ten
s	ka + hw_ten	ka + hw_ten	ka + hw_ten
t	hw_ka + ten	hw_ka + ten	hw_ka + ten
u	kaks	k_i + a_m + ks_f	kaks	Hangul syllables are maintained.

Normalization Form KC Examples

Cases (a-k) above are the same in both Normalization Form C and KC, and are not repeated here.

Original Decomposed Composed
Notes

l' "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is decomposed in Normalization Form KC (where it is not in Normalization Form C).

m' "Ä\uFB03n" "A\u0308\ffin" "Äffin"

n' "Henry IV" "Henry IV" "Henry IV" Similarly, the resulting strings here are identical in Normalization Form KC.

o' "Henry \u2163" "Henry IV" "Henry IV"

p' ga ka + ten ga Different compatibility equivalents of a single Japanese character will result in the same string in Normalization Form KC.

q' ka + ten ka + ten ga

r' hw_ka + hw_ten ka + ten ga

s' ka + hw_ten ka + ten ga

t' hw_ka + ten ka + ten ga

u' kaks k_i + a_m + ks_f kaks
Hangul syllables are maintained.

In earlier versions of Unicode, jamo characters like ks_f had compatibility mappings to k_f + s_f. These mappings were removed in Unicode 2.1.9 to ensure that Hangul syllables are maintained.

	Original	Decomposed	Composed	Notes
l'	"Äffin"	"A\u0308ffin"	"Äffin"	The ffi_ligature (U+FB03) is decomposed in Normalization Form KC (where it is not in Normalization Form C).
m'	"Ä\uFB03n"	"A\u0308\ffin"	"Äffin"
n'	"Henry IV"	"Henry IV"	"Henry IV"	Similarly, the resulting strings here are identical in Normalization Form KC.
o'	"Henry \u2163"	"Henry IV"	"Henry IV"
p'	ga	ka + ten	ga	Different compatibility equivalents of a single Japanese character will result in the same string in Normalization Form KC.
q'	ka + ten	ka + ten	ga
r'	hw_ka + hw_ten	ka + ten	ga
s'	ka + hw_ten	ka + ten	ga
t'	hw_ka + ten	ka + ten	ga
u'	kaks	k_i + a_m + ks_f	kaks	Hangul syllables are maintained. In earlier versions of Unicode, jamo characters like ks_f had compatibility mappings to k_f + s_f. These mappings were removed in Unicode 2.1.9 to ensure that Hangul syllables are maintained.

Annex: Design Goals

The following were the design goals for the specification of the normalization forms, and are presented here for reference.

Goal 1: Uniqueness

The first major design goal for the normalization forms is uniqueness: two equivalent strings will have precisely the same normalized form. More explicitly,

If two strings x and y are canonical equivalents, then
- C(x) = C(y)
- D(x) = D(y)
If two strings are compatibility equivalents, then
- KC(x) = KC(y)
- KD(x) = KD(y)

This is an absolutely required goal.

Goal 2: Stability

The second major design goal for the normalization forms is stability of characters that are not involved in the composition or decomposition process.

If X contains a character with a compatibility decomposition, then D(X) and C(X) still contain that character.
If the only decomposible characters in X are primary (see D2) and there are no combining characters, then C(X) = X.

There are four exceptions to Goal 2.2 in the Unicode Standard Version 2.1, according to D3. Four new characters have been accepted to remedy this situation by the time the database version is fixed, in Unicode 3.0. These are:

0226 LATIN CAPITAL LETTER A WITH DOT ABOVE
0227 LATIN SMALL LETTER A WITH DOT ABOVE
0228 LATIN CAPITAL LETTER E WITH CEDILLA
0229 LATIN SMALL LETTER E WITH CEDILLA

Goal 3: Efficiency

The third major design goal for the normalization forms is that it allow for efficient implementations.

It is possible to implement efficient code for producing the Normalization Forms. In particular, it should be possible to produce Normalization Form C very quickly from strings that are already in Normalization Form C or are in Normalization Form D.

Annex: Implementation Notes

Efficiency

There are a number of optimizations that can be made in programs that produce Normalization Form C. Rather than first decomposing the text fully, a quick check can be made on each character. If it is already in the proper precomposed form, then no work has to be done. Only if the current character is combining or in the Composition Exclusion Table does a slower code path need to be invoked. (This code path will need to look at previous characters, back to the last character with canonical class zero. See Trailing Characters for more information.)

The majority of the cycles spent in doing composition is spent looking up the appropriate data. The data lookup for Normalization Form C can be very efficiently implemented, since it only has to look up pairs of characters, not arbitrary strings. First a multi-stage table (as discussed in on page 5-8 of The Unicode Standard, Version 2.0) is used to map a character c to a small integer i in a contiguous range from 0 to n. The code for doing this looks like:

i = data[index[c >> BLOCKSHIFT] + (c & BLOCKMASK)];

Then a pair of these small integers are simply mapped through a two-dimensional array to get a resulting value. This yields much better performance than a general-purpose string lookup in a hash table.

Hangul

Since the Hangul compositions and decompositions are algorithmic, memory storage can be significantly reduced if the corresponding operations are done in code rather than by simply storing the data in the general purpose tables. Here is is sample code illustrating algorithmic Hangul canonical decomposition and composition done according to the specification in Section 3.10 Combining Jamo Behavior. Although coded in Java, the same structure can be used in other programming languages.

Common Constants

    static final int
        SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
        LCount = 19, VCount = 21, TCount = 28,
        NCount = VCount * TCount,   // 588
        SCount = LCount * NCount;   // 11172

Hangul Decomposition

    public static String decomposeHangul(char s) {
        int SIndex = s - SBase;
        if (SIndex < 0 || SIndex >= SCount) {
            return String.valueOf(s);
        }
        StringBuffer result = new StringBuffer();
        int L = LBase + SIndex / NCount;
        int V = VBase + (SIndex % NCount) / TCount;
        int T = TBase + SIndex % TCount;
        result.append((char)L);
        result.append((char)V);
        if (T != TBase) result.append((char)T);
        return result.toString();
    }

Hangul Composition

Notice an important feature of Hangul composition. Whenever the source string is not in Normalization Form D, you can't just detect character sequences of the form <L, V> and <L, V, T>. You also must catch the sequences of the form <LV, T>. To guarantee uniqueness, these sequences must also be composed. This is illustrated in Step 2 below.

    public static String composeHangul(String source) {
        int len = source.length();
        if (len == 0) return "";
        StringBuffer result = new StringBuffer();
        char last = source.charAt(0);            // copy first char
        result.append(last);
        for (int i = 1; i < len; ++i) {
            char ch = source.charAt(i);
            // 1. check to see if two current characters are L and V
            int LIndex = last - LBase;
            if (0 <= LIndex && LIndex < LCount) {
                int VIndex = ch - VBase;
                if (0 <= VIndex && VIndex < VCount) {
                    // make syllable of form LV
                    last = (char)(SBase + (LIndex * VCount + VIndex) * TCount);
                    result.setCharAt(result.length()-1, last); // reset last
                    continue; // discard ch
                }
            }
            // 2. check to see if two current characters are LV and T
            int SIndex = last - SBase;
            if (0 <= SIndex && SIndex < SCount && (SIndex % TCount) == 0) {
                int TIndex = ch - TBase;
                if (0 <= TIndex && TIndex <= TCount) {
                    // make syllable of form LVT
                    last += TIndex;
                    result.setCharAt(result.length()-1, last); // reset last
                    continue; // discard ch
                }
            }
            // if neither case was true, just add the character
            last = ch;
            result.append(ch);
        }
        return result.toString();
    }

Additional transformations can be performed on sequences of Hangul jamo for various purposes. For example, to regularize sequences of Hangul jamo into standard syllables, the choseong and jungseong fillers can be inserted, as described in Chapter 3. (In the text of the 2.0 standard, these standard syllables are called canonical syllables, but this has nothing to do with canonical composition or decomposition.) For keyboard input, additional compositions may be performed. For example, the trailing consonants k_f + s_f may be combined into ks_f. In addition, some Hangul input methods do not require a distinction on input between initial and final consonants, and change between them on the basis of context. For example, in the keyboard sequence m_i + e_m + n_i + s_i + a_m, the consonant n_i would be reinterpreted as n_f, since there is no possible syllable nsa. This results in the two syllables men and sa.

However, none of these additional transformations are considered part of the Unicode Normalization Formats.

Hangul Character Names

Hangul decomposition is also used to form the character names for the Hangul syllables. Here is sample code that illustrates this process:

    public static String getHangulName(char s) {
        int SIndex = s - SBase;
        if (0 > SIndex || SIndex >= SCount) {
            throw new IllegalArgumentException("Not a Hangul Syllable: " + s);
        }
        StringBuffer result = new StringBuffer();
        int LIndex = SIndex / NCount;
        int VIndex = (SIndex % NCount) / TCount;
        int TIndex = SIndex % TCount;
        return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex]
          + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
    }

    static private String[] JAMO_L_TABLE = {
        "G", "GG", "N", "D", "DD", "L", "M", "B", "BB",
        "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
    };
    
    static private String[] JAMO_V_TABLE = {
        "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
        "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
        "YU", "EU", "YI", "I"
    };
    
    static private String[] JAMO_T_TABLE = {
        "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
        "LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
        "S", "SS", "NG", "J", "C", "K", "T", "P", "H"
    };

Normalization Code Sample

This section discusses three different possible approaches to composition. These alternatives are fine composition (i.e., Normalization Form C), coarse composition, and medium composition. Code samples of the first two forms are provided for comparison.

Fine Composition

The following code snippet shows a sample implementation of Normalization Form C. For comparison, this approach is constrasted with some alternative approaches below. For the purposes of discussion we can call this form fine composition. Although coded in Java, the same structure can be used in other programming languages. For a live demonstration of the code, see http://www.macchiato.com/mark/compose/.

    /**
     * Implements the specification described in UTR#15
     * To isolate the relevent features for this example,
     * source is presumed to already be in Normalization Form D.
     */
    static void fineCompose(String source, StringBuffer target) {
        StringBuffer buffer = new StringBuffer();
        for (int i = 0; i < source.length(); ++i) {
            char ch = source.charAt(i);
            int currentClass = charClass(ch);
            int len = buffer.length();
            // check if the new character combines with the first
            // buffer character
            if (len != 0) {
                char composite = pairwiseCombines(buffer.charAt(0), ch);
                if (composite != NOT_A_CHAR          // if combines & !blocked
                  && (len == 1 || charClass(buffer.charAt(len-1)) != currentClass)) {
                    buffer.setCharAt(0, composite);  // then replace first
                    continue;   // done with char for this iteration
                }
            }
            if (charClass(ch) == 0) {   // if zero-class,
                target.append(buffer);  // add buffer to target
                buffer.setLength(0);    // clear buffer
            }
            buffer.append(ch); // add character to buffer
        }
        // add last buffer
        target.append(buffer);
    }

    /**
     * Return the canonical combining class 
     * derived from the Unicode character database.
     */
    static int charClass(char ch) {...}

    /**
     * Return the precomposed character corresponding to the two
     * component characters. Returns NOT_A_CHAR if no such
     * precomposed character exists. Based on the Unicode Character
     * database, but doesn't include the primary excluded characters.
     */
    static char pairwiseCombines(char first, char second) {...}

Coarse Composition

An alternative style of composition was considered, which for the purposes of discussion we can call coarse composition. With this mechanism, a combining character sequence only composes if the entire sequence can be represented by a single precomposed character. This may appear to be a simpler option, but it has the disadvantage that an irrevalent combining mark can cause a precomposed character to break down. As the code samples show, there is actually not much difference in complexity in practice. For a live demonstration of the code, see http://www.macchiato.com/mark/compose/.

    /**
     * Coarse composition is presented here for comparison.
     * To isolate the relevent features for this example,
     * source is presumed to already be in Normalization Form D.
     */
    static void coarseCompose(String source, StringBuffer target) {
        StringBuffer buffer = new StringBuffer();
        for (int i = 0; i < source.length(); ++i) {
            char ch = source.charAt(i);
            boolean isBase = isBaseChar(ch);
            // if the previous chars are a possible sequence,
            // either add them to target, or add the equivalent composite
            if (isBase && buffer.length() != 0) {
                char composite = coarseCombines(buffer);
                if (composite == NOT_A_CHAR) {  // doesn't combine, so
                    target.append(buffer);      // add buffer to target
                } else {                        // does combine, so
                    target.append(composite);   // add composite to target
                }
                buffer.setLength(0);            // clear buffer
            }
            buffer.append(ch);                  // add character to buffer
        }
        // check last buffer
        if (buffer.length() != 0) {
            char composite = coarseCombines(buffer);
            if (composite == NOT_A_CHAR) {  // doesn't combine, so
                target.append(buffer);      // add buffer to target
            } else {                        // does combine, so
                target.append(composite);   // add composite to target
            }
        }
    }

    /**
     * Returns true if the character is a base character.
     */
    static boolean isBaseChar(char ch) {...}

    /**
     * Returns true if the buffer corresponds to a single precomposed
     * character, not including the primary excluded characters.
     */
    static char coarseCombines(StringBuffer buffer) {...}

Medium Composition

A second alternative style of composition is similar to coarse composition, except that it will combine initial subsequences as long as there are no intervening combining marks. For the purposes of discussion we can call this medium composition. Although this produces better results than coarse combination, it does not do as well as fine composition.

Annex: Decomposition

For those accessing this document without access to the Unicode Standard, the following summarizes the canonical decomposition process. For a complete discussion, see Sections 3.6, 3.9 and 3.10.

A sequence of two characters in a string is an aberrant pair if the combining class for the first character is greater than the combining class for the second and the combining class of the second is greater than zero.That is, if CC(first) > CC(second) > 0.

Examples:

Sequence Combining classes Status

<acute, cedilla> 230, 202 aberrant, since 230 > 202

<a, acute> 0, 230 not aberrant, since 0 <= 230

<diaeresis, acute> 230, 230 not aberrant, since 230 <= 230

<acute, a> 230, 0 not aberrant, since the second class is zero.

Sequence	Combining classes	Status
<acute, cedilla>	230, 202	aberrant, since 230 > 202
<a, acute>	0, 230	not aberrant, since 0 <= 230
<diaeresis, acute>	230, 230	not aberrant, since 230 <= 230
<acute, a>	230, 0	not aberrant, since the second class is zero.

A string is put into canonical order by repeatedly replacing any aberrant pair by the pair in reversed order. When there are no remaining aberrant pairs, then the string is in canonical order. Note that the replacements can be done in any order.

Canonical decomposition is the process of taking a string, recursively replacing composite characters using the Unicode canonical decomposition mappings (including the algorithmic Hangul canonical decomposition mappings), and putting the result in canonical order.

Example:

1. Take the string with the characters "ác´¸" (a-acute, c, acute, cedilla)

2. The data file contains the following relevant information:
0061;LATIN SMALL LETTER A;...;0;...
0063;LATIN SMALL LETTER C;...;0;...
00E1;LATIN SMALL LETTER A WITH ACUTE;...;0;...;0061 0301;...
0107;LATIN SMALL LETTER C WITH ACUTE;...;0;...;0063 0301;...
0301;COMBINING ACUTE ACCENT;...;230;...
0327;COMBINING CEDILLA;...;202;...
3. Applying the canonical decomposition mappings, we get "a´c´¸" (a, acute, c, acute, cedilla).
This is because 00E1 (a-acute) has a canonical decomposition mapping to 0061 0301 (a, acute)

4. Applying the canonical ordering, we get "a´c¸´" (a, acute, c, cedilla, acute)
This is because cedilla has a lower canonical ordering value (202) than acute (230) does. The positions of 'a' and 'c' are not affected, since they have zero canonical ordering values.

Compatibility decomposition is the process of taking a string, replacing composite characters using the both the Unicode canonical decomposition mappings and the Unicode compatibility decomposition mappings, and reordering the result according to the Unicode canonical ordering values.

Annex: Trailing Characters

The Trailing Characters table lists the characters in Unicode 3.0 (draft, as of this writing) that may occur in a canonical decomposition of a character, but not as the first character of that decomposition. The inclusion of this table here is informative: the table can be generated from the Unicode Character Database.

If a string does not contain characters in Trailing Characters table or in the Composition Exclusion Table, then none of its characters participate in compositions, so the only processing required for Normalization Form C is to make sure that the characters are in canonical order. The Other Class Zero Characters table contains all of the Unicode 3.0 characters that are of non-zero canonical class and in neither the Trailing Characters table nor the Composition Exclusion table. If a string contains no characters from any of these three tables, then it is in Normalization Form C already.

Trailing Characters

0300 COMBINING GRAVE ACCENT ..0304 COMBINING MACRON 0306 COMBINING BREVE ..030C COMBINING CARON 030F COMBINING DOUBLE GRAVE ACCENT 0311 COMBINING INVERTED BREVE 0313 COMBINING COMMA ABOVE 0314 COMBINING REVERSED COMMA ABOVE 031B COMBINING HORN 0323 COMBINING DOT BELOW ..0328 COMBINING OGONEK 032D COMBINING CIRCUMFLEX ACCENT BELOW 032E COMBINING BREVE BELOW 0330 COMBINING TILDE BELOW 0331 COMBINING MACRON BELOW 0338 COMBINING LONG SOLIDUS OVERLAY 0342 COMBINING GREEK PERISPOMENI 0345 COMBINING GREEK YPOGEGRAMMENI 05B4 HEBREW POINT HIRIQ 05B7 HEBREW POINT PATAH ..05B9 HEBREW POINT HOLAM 05BC HEBREW POINT DAGESH OR MAPIQ 05BF HEBREW POINT RAFE 05C1 HEBREW POINT SHIN DOT 05C2 HEBREW POINT SIN DOT 093C DEVANAGARI SIGN NUKTA 09BC BENGALI SIGN NUKTA 09BE BENGALI VOWEL SIGN AA 09D7 BENGALI AU LENGTH MARK 0A3C GURMUKHI SIGN NUKTA 0B3C ORIYA SIGN NUKTA 0B3E ORIYA VOWEL SIGN AA 0B56 ORIYA AI LENGTH MARK 0B57 ORIYA AU LENGTH MARK 0BBE TAMIL VOWEL SIGN AA 0BD7 TAMIL AU LENGTH MARK 0C56 TELUGU AI LENGTH MARK 0CC2 KANNADA VOWEL SIGN UU 0CD5 KANNADA LENGTH MARK 0CD6 KANNADA AI LENGTH MARK 0D3E MALAYALAM VOWEL SIGN AA 0D57 MALAYALAM AU LENGTH MARK 0DCA SINHALA SIGN AL-LAKUNA 0DCF SINHALA VOWEL SIGN AELA-PILLA 0DDF SINHALA VOWEL SIGN GAYANUKITTA0E32 THAI CHARACTER SARA AA 0EB2 LAO VOWEL SIGN AA 0F71 TIBETAN VOWEL SIGN AA 0F74 TIBETAN VOWEL SIGN U 0F80 TIBETAN VOWEL SIGN REVERSED I 0FB5 TIBETAN SUBJOINED LETTER SSA 0FB7 TIBETAN SUBJOINED LETTER HA 102E MYANMAR VOWEL SIGN II 1161 HANGUL JUNGSEONG A ..1175 HANGUL JUNGSEONG I 11A8 HANGUL JONGSEONG KIYEOK ..11C2 HANGUL JONGSEONG HIEUH 3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK 309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Trailing Characters
0300 COMBINING GRAVE ACCENT ..0304 COMBINING MACRON 0306 COMBINING BREVE ..030C COMBINING CARON 030F COMBINING DOUBLE GRAVE ACCENT 0311 COMBINING INVERTED BREVE 0313 COMBINING COMMA ABOVE 0314 COMBINING REVERSED COMMA ABOVE 031B COMBINING HORN 0323 COMBINING DOT BELOW ..0328 COMBINING OGONEK 032D COMBINING CIRCUMFLEX ACCENT BELOW 032E COMBINING BREVE BELOW 0330 COMBINING TILDE BELOW 0331 COMBINING MACRON BELOW 0338 COMBINING LONG SOLIDUS OVERLAY 0342 COMBINING GREEK PERISPOMENI 0345 COMBINING GREEK YPOGEGRAMMENI 05B4 HEBREW POINT HIRIQ 05B7 HEBREW POINT PATAH ..05B9 HEBREW POINT HOLAM 05BC HEBREW POINT DAGESH OR MAPIQ 05BF HEBREW POINT RAFE 05C1 HEBREW POINT SHIN DOT 05C2 HEBREW POINT SIN DOT 093C DEVANAGARI SIGN NUKTA 09BC BENGALI SIGN NUKTA 09BE BENGALI VOWEL SIGN AA 09D7 BENGALI AU LENGTH MARK 0A3C GURMUKHI SIGN NUKTA 0B3C ORIYA SIGN NUKTA 0B3E ORIYA VOWEL SIGN AA	0B56 ORIYA AI LENGTH MARK 0B57 ORIYA AU LENGTH MARK 0BBE TAMIL VOWEL SIGN AA 0BD7 TAMIL AU LENGTH MARK 0C56 TELUGU AI LENGTH MARK 0CC2 KANNADA VOWEL SIGN UU 0CD5 KANNADA LENGTH MARK 0CD6 KANNADA AI LENGTH MARK 0D3E MALAYALAM VOWEL SIGN AA 0D57 MALAYALAM AU LENGTH MARK 0DCA SINHALA SIGN AL-LAKUNA 0DCF SINHALA VOWEL SIGN AELA-PILLA 0DDF SINHALA VOWEL SIGN GAYANUKITTA0E32 THAI CHARACTER SARA AA 0EB2 LAO VOWEL SIGN AA 0F71 TIBETAN VOWEL SIGN AA 0F74 TIBETAN VOWEL SIGN U 0F80 TIBETAN VOWEL SIGN REVERSED I 0FB5 TIBETAN SUBJOINED LETTER SSA 0FB7 TIBETAN SUBJOINED LETTER HA 102E MYANMAR VOWEL SIGN II 1161 HANGUL JUNGSEONG A ..1175 HANGUL JUNGSEONG I 11A8 HANGUL JONGSEONG KIYEOK ..11C2 HANGUL JONGSEONG HIEUH 3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK 309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Other Class Zero Characters

0305 COMBINING OVERLINE 030D COMBINING VERTICAL LINE ABOVE 030E COMBINING DOUBLE VERTICAL LINE ABOVE 0310 COMBINING CANDRABINDU 0312 COMBINING TURNED COMMA ABOVE 0315 COMBINING COMMA ABOVE RIGHT ..031A COMBINING LEFT ANGLE ABOVE 031C COMBINING LEFT HALF RING BELOW ..0322 COMBINING RETROFLEX HOOK BELOW 0329 COMBINING VERTICAL LINE BELOW ..032C COMBINING CARON BELOW 032F COMBINING INVERTED BREVE BELOW 0332 COMBINING LOW LINE ..0337 COMBINING SHORT SOLIDUS OVERLAY 0339 COMBINING RIGHT HALF RING BELOW ..033F COMBINING DOUBLE OVERLINE 0344 COMBINING GREEK DIALYTIKA TONOS 0346 COMBINING BRIDGE ABOVE ..034E COMBINING UPWARDS ARROW BELOW 0360 COMBINING DOUBLE TILDE ..0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW 0483 COMBINING CYRILLIC TITLO ..0486 COMBINING CYRILLIC PSILI PNEUMATA 0591 HEBREW ACCENT ETNAHTA ..05A1 HEBREW ACCENT PAZER 05A3 HEBREW ACCENT MUNAH ..05B3 HEBREW POINT HATAF QAMATS 05B5 HEBREW POINT TSERE 05B6 HEBREW POINT SEGOL 05BB HEBREW POINT QUBUTS 05BD HEBREW POINT METEG 05C4 HEBREW MARK UPPER DOT 064B ARABIC FATHATAN ..0655 ARABIC HAMZA BELOW 0670 ARABIC LETTER SUPERSCRIPT ALEF 06D6 ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA ..06DC ARABIC SMALL HIGH SEEN 06DF ARABIC SMALL HIGH ROUNDED ZERO ..06E4 ARABIC SMALL HIGH MADDA 06E7 ARABIC SMALL HIGH YEH 06E8 ARABIC SMALL HIGH NOON 06EA ARABIC EMPTY CENTRE LOW STOP ..06ED ARABIC SMALL LOW MEEM 0711 SYRIAC LETTER SUPERSCRIPT ALAPH 0730 SYRIAC PTHAHA ABOVE ..074A SYRIAC BARREKH 094D DEVANAGARI SIGN VIRAMA 0951 DEVANAGARI STRESS SIGN UDATTA ..0954 DEVANAGARI ACUTE ACCENT 09CD BENGALI SIGN VIRAMA 0A4D GURMUKHI SIGN VIRAMA 0ABC GUJARATI SIGN NUKTA 0ACD GUJARATI SIGN VIRAMA 0B4D ORIYA SIGN VIRAMA 0BCD TAMIL SIGN VIRAMA 0C46 TELUGU VOWEL SIGN E 0C4D TELUGU SIGN VIRAMA 0C55 TELUGU LENGTH MARK 0CCD KANNADA SIGN VIRAMA 0D4D MALAYALAM SIGN VIRAMA 0E38 THAI CHARACTER SARA U ..0E3A THAI CHARACTER PHINTHU 0E48 THAI CHARACTER MAI EK ..0E4B THAI CHARACTER MAI CHATTAWA 0E4D THAI CHARACTER NIKHAHIT 0EB8 LAO VOWEL SIGN U 0EB9 LAO VOWEL SIGN UU 0EC8 LAO TONE MAI EK ..0ECB LAO TONE MAI CATAWA 0ECD LAO NIGGAHITA 0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA 0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS 0F35 TIBETAN MARK NGAS BZUNG NYI ZLA 0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS 0F39 TIBETAN MARK TSA -PHRU 0F72 TIBETAN VOWEL SIGN I 0F7A TIBETAN VOWEL SIGN E ..0F7D TIBETAN VOWEL SIGN OO 0F82 TIBETAN SIGN NYI ZLA NAA DA ..0F84 TIBETAN MARK HALANTA 0F86 TIBETAN SIGN LCI RTAGS 0F87 TIBETAN SIGN YANG RTAGS 1037 MYANMAR SIGN DOT BELOW 1039 MYANMAR SIGN VIRAMA 17B5 KHMER VOWEL INHERENT AA 17D2 KHMER SIGN COENG 18A9 MONGOLIAN LETTER AG DAGALGA 20D0 COMBINING LEFT HARPOON ABOVE ..20DC COMBINING FOUR DOTS ABOVE 20E1 COMBINING LEFT RIGHT ARROW ABOVE 302A IDEOGRAPHIC LEVEL TONE MARK ..302F HANGUL DOUBLE DOT TONE MARK FB1E HEBREW POINT JUDEO-SPANISH VARIKA FE20 COMBINING LIGATURE LEFT HALF ..FE23 COMBINING DOUBLE TILDE RIGHT HALF

Other Class Zero Characters
0305 COMBINING OVERLINE 030D COMBINING VERTICAL LINE ABOVE 030E COMBINING DOUBLE VERTICAL LINE ABOVE 0310 COMBINING CANDRABINDU 0312 COMBINING TURNED COMMA ABOVE 0315 COMBINING COMMA ABOVE RIGHT ..031A COMBINING LEFT ANGLE ABOVE 031C COMBINING LEFT HALF RING BELOW ..0322 COMBINING RETROFLEX HOOK BELOW 0329 COMBINING VERTICAL LINE BELOW ..032C COMBINING CARON BELOW 032F COMBINING INVERTED BREVE BELOW 0332 COMBINING LOW LINE ..0337 COMBINING SHORT SOLIDUS OVERLAY 0339 COMBINING RIGHT HALF RING BELOW ..033F COMBINING DOUBLE OVERLINE 0344 COMBINING GREEK DIALYTIKA TONOS 0346 COMBINING BRIDGE ABOVE ..034E COMBINING UPWARDS ARROW BELOW 0360 COMBINING DOUBLE TILDE ..0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW 0483 COMBINING CYRILLIC TITLO ..0486 COMBINING CYRILLIC PSILI PNEUMATA 0591 HEBREW ACCENT ETNAHTA ..05A1 HEBREW ACCENT PAZER 05A3 HEBREW ACCENT MUNAH ..05B3 HEBREW POINT HATAF QAMATS 05B5 HEBREW POINT TSERE 05B6 HEBREW POINT SEGOL 05BB HEBREW POINT QUBUTS 05BD HEBREW POINT METEG 05C4 HEBREW MARK UPPER DOT 064B ARABIC FATHATAN ..0655 ARABIC HAMZA BELOW 0670 ARABIC LETTER SUPERSCRIPT ALEF 06D6 ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA ..06DC ARABIC SMALL HIGH SEEN 06DF ARABIC SMALL HIGH ROUNDED ZERO ..06E4 ARABIC SMALL HIGH MADDA 06E7 ARABIC SMALL HIGH YEH 06E8 ARABIC SMALL HIGH NOON 06EA ARABIC EMPTY CENTRE LOW STOP ..06ED ARABIC SMALL LOW MEEM 0711 SYRIAC LETTER SUPERSCRIPT ALAPH 0730 SYRIAC PTHAHA ABOVE ..074A SYRIAC BARREKH 094D DEVANAGARI SIGN VIRAMA 0951 DEVANAGARI STRESS SIGN UDATTA ..0954 DEVANAGARI ACUTE ACCENT	09CD BENGALI SIGN VIRAMA 0A4D GURMUKHI SIGN VIRAMA 0ABC GUJARATI SIGN NUKTA 0ACD GUJARATI SIGN VIRAMA 0B4D ORIYA SIGN VIRAMA 0BCD TAMIL SIGN VIRAMA 0C46 TELUGU VOWEL SIGN E 0C4D TELUGU SIGN VIRAMA 0C55 TELUGU LENGTH MARK 0CCD KANNADA SIGN VIRAMA 0D4D MALAYALAM SIGN VIRAMA 0E38 THAI CHARACTER SARA U ..0E3A THAI CHARACTER PHINTHU 0E48 THAI CHARACTER MAI EK ..0E4B THAI CHARACTER MAI CHATTAWA 0E4D THAI CHARACTER NIKHAHIT 0EB8 LAO VOWEL SIGN U 0EB9 LAO VOWEL SIGN UU 0EC8 LAO TONE MAI EK ..0ECB LAO TONE MAI CATAWA 0ECD LAO NIGGAHITA 0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA 0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS 0F35 TIBETAN MARK NGAS BZUNG NYI ZLA 0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS 0F39 TIBETAN MARK TSA -PHRU 0F72 TIBETAN VOWEL SIGN I 0F7A TIBETAN VOWEL SIGN E ..0F7D TIBETAN VOWEL SIGN OO 0F82 TIBETAN SIGN NYI ZLA NAA DA ..0F84 TIBETAN MARK HALANTA 0F86 TIBETAN SIGN LCI RTAGS 0F87 TIBETAN SIGN YANG RTAGS 1037 MYANMAR SIGN DOT BELOW 1039 MYANMAR SIGN VIRAMA 17B5 KHMER VOWEL INHERENT AA 17D2 KHMER SIGN COENG 18A9 MONGOLIAN LETTER AG DAGALGA 20D0 COMBINING LEFT HARPOON ABOVE ..20DC COMBINING FOUR DOTS ABOVE 20E1 COMBINING LEFT RIGHT ARROW ABOVE 302A IDEOGRAPHIC LEVEL TONE MARK ..302F HANGUL DOUBLE DOT TONE MARK FB1E HEBREW POINT JUDEO-SPANISH VARIKA FE20 COMBINING LIGATURE LEFT HALF ..FE23 COMBINING DOUBLE TILDE RIGHT HALF

Annex: Intellectual Property

Transcript of letter regarding disclosure of IBM Technology
(Hard copy is on file with the Chair of UTC and the Chair of NCITS/L2)
Transcribed on 1998-03-10

February 26, 1999

The Chair, Unicode Technical Committee

Subject: Disclosure of IBM Technology - Unicode Normalization Forms

The attached document entitled "Unicode Normalization Forms" does not require IBM technology, but may be implemented using IBM technology that has been filed for US Patent. However, IBM believes that the technology could be beneficial to the software community at large, especially with respect to usage on the Internet, allowing the community to derive the enormous benefits provided by Unicode.

This letter is to inform you that IBM is pleased to make the Unicode normalization technology that has been filed for patent freely available to anyone using them in implementing to the Unicode standard.

Sincerely,

W. J. Sullivan,
Acting Director of National Language Support
and Information Development

Copyright

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Revision	14
Authors	Mark Davis (mark@unicode.org)
Date	1999-03-22
This Version	http://www.unicode.org/unicode/reports/tr15/tr15-14.html
Previous Version	http://www.unicode.org/unicode/reports/tr15/tr15-13.html
Latest Version	http://www.unicode.org/unicode/reports/tr15
Unicode Technical Reports	http://www.unicode.org/unicode/reports/

	When X blocks Y, changing the order of X and Y would result in a character sequence that is not canonically equivalent to the original. See Section 3.9 Canonical Ordering Behavior in the Unicode Standard.
	If a combining character sequence is in canonical order, then testing whether a character is blocked only requires looking at the immediately preceding character.

	Typically there is a single accepted invertible transcoding for a given legacy character set. In in a few cases there may be multiple invertible transcodings: for example, JIS may have two different mappings used in different circumstances: one to preserve the '/' semantics of 2F₁₆, and one to preserve the '¥' semantics.
	The character indexes in the legacy character set string may be very different than character indexes in the Unicode equivalent. For example, if a legacy string uses visual encoding for Hebrew, then its first character might be the last character in the Unicode string.