|Authors||Mark Davis (firstname.lastname@example.org)|
This document describes normalized composition formats for Unicode text. The design is in initial phase. We welcome review feedback corrigenda.
Status of this document
This document is a proposed draft, posted for review by the members of the Unicode Technical Committee (UTC). At its next meeting, the UTC may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.
The Unicode Standard, Version 2.0 defines a canonical decomposition format, which can be used as a normalization for interchanging text. This format allows for binary comparison while maintaining canonical equivalence (see Section 3.9, Canonical Ordering Behavior).
The standard also defines a compatibility decomposition format, which allows for binary comparison while maintaining compatibility equivalence. The latter can also be useful in many circumstances, since it levels the differences between compatibility characters which are inappropriate in those circumstances. For example, the half-width and full-width katakana characters will have the same compatibility decomposition and are thus compatibility equivalents; however, they are not canonical equivalents.
In many environments, there is the additional requirement for a composition format, which can also be used as a normalization for interchanging text, but which generally maintains the precomposed characters. This format should also allow for binary comparison while maintaining Unicode equivalence. It will probably become required for much of the WWW content, and might also be adapted by other places, in particular many IETF protocols. It is also useful in the case of programming language identifiers that can contain Unicode characters.
As with decomposition, there are two forms of composition, canonical composition and compatibility composition. The former maintains canonical equivalence, while the latter maintains compatibility equivalence.
The compatibility equivalence can be even more useful in normalization, since it levels the differences between compatibility characters which are inappropriately distinguished in many circumstances. For example, the half-width and full-width katakana characters will have the same compatibility composition, as will Roman Numerals and their letter equivalents. More complete examples are provided below.
Because additional precomposed characters may be added to future versions of the Unicode standard, composition is less stable than decomposition. Therefore, it is necessary to specify the version of the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.
For example, suppose that Unicode version 3.0 adds the precomposed character H-caron. For an implementation that uses Unicode version 3.0 and composition version 2.1.3, canonically composed strings will continue to contain the sequence H + caron, and not the new character H-caron.
Typical strings of precomposed accented Unicode characters are already in canonical composed format. However, there are circumstances with possible ambiguities, which requires the precise specification in this document.
Logically, the process of forming a composition involves:
This is the logical description of the process--implementations are free to use more efficient algorithms as long as the result is the same.
In the following examples, we will use the "...\uXXXX..." notation to represent the Unicode character U+XXXX embedded within a string.
There is an asymmetry between canonical composition and compatibility composition. A compatibility decomposition is the equivalent of taking a compatibility decomposition, then applying a canonical composition. It does not attempt to map characters to precomposed compatibility forms. For example, a compatibility composition of "office" does not produce "o\uFB03ce", even though "\uFB03" is a compatibility character that is comprised of the three characters "ffi".
A process that produces Unicode text that purports to be canonically composed shall do so in accordance with the specifications in this document.
A process that produces Unicode text that purports to be compatibility composed shall do so in accordance with the specifications in this document.
This specification is written in terms of a process for producing a canonical composition from an arbitrary Unicode string. This is a logical description--particular implementations can have more efficient mechanisms as long as they produce the same result.
A process that tests Unicode text to determine whether it is a canonical composition shall do so in accordance with the specifications in this document.
A process that tests Unicode text to determine whether it is a compatibility composition shall do so in accordance with the specifications in this document.
As above, this specification provides a logical description of how to test a string to determine whether it is in a canonical composed format. Such a test can be implemented without applying this process at all, as long as the result is the same as if the process had been applied.
The result of this process is a new string S' which is the canonical composition of S under A
The result of this process is a new string S' which is the compatibility composition of S.
In the examples, the following conventions are used for brevity:
|a||D-dot_above||D + dot_above||D-dot_above|
|b||D + dot_above||D + dot_above||D-dot_above|
|c||D-dot_below + dot_above||D + dot_below + dot_above||D-dot_below + dot_above|
|d||D-dot_above + dot_below||D + dot_below + dot_above||D-dot_below + dot_above|
|e||D + dot_above + dot_below||D + dot_below + dot_above||D-dot_below + dot_above|
|f||D + dot_above+ cedilla + dot_below||D + cedilla + dot_below + dot_above||D-dot_below + cedilla + dot_above|
|g||E-macron-grave||E + macron + grave||E-macron-grave|
|h||E-macron + grave||E + macron + grave||E-macron-grave|
|i||E-grave + macron||E + grave + macron||E-grave + macron|
|l||"Henry IV"||"Henry IV"||"Henry IV"|
|m||"Henry \u2163"||"Henry \u2163"||"Henry \u2163"|
|n||hw_ka + hw_ten||ka + combining_ten||ga|
|o||ka + hw_ten||ka + combining_ten||ga|
|p||hw_ka + combining_ten||ka + combining_ten||ga|
|q||ka + combining_ten||ka + combining_ten||ga|
|r||ga||ka + combining_ten||ga|
|u||"Henry IV"||"Henry IV"||"Henry IV"|
|v||"Henry \u2163"||"Henry IV"||"Henry IV"|
Copyright ę 1998-1998 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.
Unicode Home Page: http://www.unicode.org
Unicode Technical Reports: http://www.unicode.org/unicode/techreports.html