PROPOSED DRAFT Unicode Technical Report #15

Unicode Composition

Revision 1.3
Authors Mark Davis (mark@unicode.org)
Date 1998-05-20
This Version http://www.unicode.org/unicode/reports/pdtr15.html
Latest Version http://www.unicode.org/unicode/reports/pdtr15.html

Summary

This document describes normalized composition formats for Unicode text. The design is in initial phase. We welcome review feedback corrigenda.

Status of this document

This document is a proposed draft, posted for review by the members of the Unicode Technical Committee (UTC). At its next meeting, the UTC may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.


Introduction

The Unicode Standard, Version 2.0 defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence (see Section 3.9, Canonical Ordering Behavior).

The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.

In many environments, there is the additional requirement for a composition format, which can also be used as a normalization for interchanging text, but which generally maintains the precomposed characters. This format should also allow for binary comparison while maintaining Unicode equivalence. It will probably become required for much of the WWW content, and might also be adapted by other places, in particular many IETF protocols. It is also useful in the case of programming language identifiers that can contain Unicode characters.

As with decomposition, there are two forms of composition, canonical composition and compatibility composition. The former maintains canonical equivalence, while the latter maintains compatibility equivalence.

The compatibility equivalence can be even more useful in normalization, since it levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will have the same compatibility composition, as will Roman Numerals and their letter equivalents. More complete examples are provided below.

Versioning

Because additional precomposed characters may be added to future versions of the Unicode standard, composition is less stable than decomposition. Therefore, it is necessary to specify the version of the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.

For example, suppose that Unicode version 3.0 adds the precomposed character H-caron. For an implementation that uses Unicode version 3.0 and composition version 2.1.3, canonically composed strings will continue to contain the sequence H + caron, and not the new character H-caron.

Process

Typical strings of precomposed accented Unicode characters are already in canonical composed format. However, there are circumstances with possible ambiguities, which requires the precise specification in this document.

Logically, the process of forming a composition involves:

This is the logical description of the process--implementations are free to use more efficient algorithms as long as the result is the same.

Note:

In the following examples, we will use the "...\uXXXX..." notation to represent the Unicode character U+XXXX embedded within a string.

There is an asymmetry between canonical composition and compatibility composition. A compatibility decomposition is the equivalent of taking a compatibility decomposition, then applying a canonical composition. It does not attempt to map characters to precomposed compatibility forms. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a compatibility character that is comprised of the three characters "ffi".

Definitions

  1. A combining character C can be canonically combined with a base character B if there is a character X such that the sequence of characters [B, C] is a canonical equivalent, according to the rules in The Unicode Standard, Chapter 3 and the decomposition mappings in UCD2.1.3.

    In such a case, X is said to be the canonical composition of B and C.
     
  2. A sequence of Unicode characters is canonically composed if it is bit-for-bit identical with its canonical composition, as described in the specification below.
     
  3. A sequence of Unicode characters is compatibility composed if it is bit-for-bit identical with its compatibility composition, as described in the specification below.

Notes:

  • As far as these algorithms are concerned, a base character is one with a combining class of zero (from Table 4-3).
  • A composition of characters B and C may be written as B-C.

Conformance

A process that produces Unicode text that purports to be canonically composed shall do so in accordance with the specifications in this document.

A process that produces Unicode text that purports to be compatibility composed shall do so in accordance with the specifications in this document.

This specification is written in terms of a process for producing a canonical composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result.

A process that tests Unicode text to determine whether it is a canonical composition shall do so in accordance with the specifications in this document.

A process that tests Unicode text to determine whether it is a compatibility composition shall do so in accordance with the specifications in this document.

As above, this specification provides a logical description of how to test a string to determine whether it is in a canonical composed format. Such a test can be implemented without applying this process at all, as long as the result is the same as if the process had been applied.


Specification

A canonical composition for a string S is defined by the following process.

  1. Generate the canonical decomposition for the source string S, according to the rules in The Unicode Standard, Chapter 3 and the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through that decomposition character by character. At each combining character C, check the following conditions with regard to C and the previous base character B, according to the Unicode Character Database version 2.1.3.
    1. C can be canonically combined with B.
    2. There are no characters between C and B that have the same combining class as C.

The result of this process is a new string S' which is the canonical composition of S under A

A compatibility composition for a string S is defined by the following process.

  1. Generate the compatibility decomposition for the source string S, , according to the rules in The Unicode Standard, Chapter 3 and the decomposition mappings in the latest supported version of the Unicode Character Database.
  2. Iterate through that decomposition character by character. At each combining character C, check the following conditions with regard to C and the previous base character B, according to the Unicode Character Database version 2.1.3.
    1. C can be canonically combined with B.
    2. There are no characters between C and B that have the same combining class as C.

The result of this process is a new string S' which is the compatibility composition of S.

Examples

In the examples, the following conventions are used for brevity:

Canonical Composition Examples:

Original =>

Decomposed =>

Composed

a D-dot_above D + dot_above D-dot_above
b D + dot_above D + dot_above D-dot_above
c D-dot_below + dot_above D + dot_below + dot_above D-dot_below + dot_above
d D-dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
e D + dot_above + dot_below D + dot_below + dot_above D-dot_below + dot_above
f D + dot_above+ cedilla + dot_below D + cedilla + dot_below + dot_above D-dot_below + cedilla + dot_above
g E-macron-grave E + macron + grave E-macron-grave
h E-macron + grave E + macron + grave E-macron-grave
i E-grave + macron E + grave + macron E-grave + macron
j "Äffin" "A\u0308ffin" "Äffin"
k "Ä\uFB00n" "A\u0308\uFB00n" "Ä\uFB00n"
l "Henry IV" "Henry IV" "Henry IV"
m "Henry \u2163" "Henry \u2163" "Henry \u2163"

Notes:

  1. In examples (c, d, e, f), by the time we have gotten to dot_above, it cannot be combined with the base character.
  2. There may be intervening combining marks, as in (f), so long as the result of the combination is canonically equivalent.
  3. Multiple combining characters are combined with successive base characters, as in (g,h)
  4. Characters will not be combined if they would not be canonical equivalents. Thus (g,h) do not have the same result as (i).
  5. In examples (j,k), the ff ligature (U+FB00) is not decomposed, since it has a compatibility mapping, not a canonical mapping. Thus the resulting compositions are not identical. (See compatibility compositions below for comparison.)
  6. Similarly, the ROMAN NUMERAL IV (U+2163) is not decomposed in example (m).

Compatibility Composition Examples:

Orginal =>

Decomposed =>

Composed

n hw_ka + hw_ten ka + combining_ten ga
o ka + hw_ten ka + combining_ten ga
p hw_ka + combining_ten ka + combining_ten ga
q ka + combining_ten ka + combining_ten ga
r ga ka + combining_ten ga
s "Äffin" "A\u0308ffin" "Äffin"
t "Ä\uFB00n" "A\u0308\ffin" "Äffin"
u "Henry IV" "Henry IV" "Henry IV"
v "Henry \u2163" "Henry IV" "Henry IV"

Notes:

  1. Cases (a-h) above are the same for canonical and compability examples, and are not repeated here.
  2. Cases (n,o) depend on a change incorporated in the Unicode Character Database 2.1.3, to make hw_ten have a compability mapping to combining_ten (instead of spacing_ten).
  3. In examples (s,t), the ff ligature (U+FB00) is decomposed where it is not in a canonical composibition. Thus the resulting compositions are identical.
  4. Similarly the strings in (u,v) have the same compatibility composition.

Copyright

Copyright © 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/techreports.html