|Authors||Ken Whistler, Asmus Freytag (email@example.com)|
|This Version||Working Draft|
|Latest Version||Working Draft|
This is the first working draft of a proposed Unicode Conformance Model
StatusThis document is a working draft draft for a proposed Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
The Unicode Standard [Unicode] is a very large and complex standard. Because of this, and because of the nature and role of the standard, it is often rather difficult to determine, in any particular case, just exactly what conformance to the Unicode Standard means.
The Unicode Standard is a fundamental low level component of potentially all software processes and protocols related to text. It forms the foundation which supports a large variety of operations on textual data, from data interchange to complex tasks like sorting, rendering or content analysis. All of these expose implementations to the complexities of human languages and writing systems.
Where existing character sets were either small, or had a well limited field of application, for example by geographical area, or both, the Unicode Standard aims to be universal. It can no longer rely on implicit agreements on the nature and behavior of the characters it encodes, but must provide explicit constraints. At the same time, the standard has to allow implementations the necessary flexibility to to address the expectations of their users, while providing enough constraints to guarantee consistency between implementations and predictable interchange of data.
This Conformance Model has been developed to explain the issue of conformance relating to the Unicode Standard so that users are better able to understand in which contexts products are making claims for support of the Unicode Standard and implementers get a better understanding of how to meet the formal conformance requirements while satisfying the expectations of their users.
This model defines terminology regarding the topic of conformance, specifies different areas and levels of conformance, and describes what it means to make a claim of conformance or "support" of the standard. This model is not, in itself, a framework for compliance testing, although it could be used to develop such a framework, should that prove desirable.
This Conformance Model does not alter, augment or override the actual Unicode Conformance requirements found in the text of the Unicode Standard. Rather it attempts to provide a conceptual framework that makes it easier for users and implementers to identify and understand the specific conformance requirements contained in [Unicode].
Many of the concepts presented here are equally applicable to other standards developed by the Unicode Consortium, such as The Unicode Collation Algrorithm [UCA], the specifications for Unicode support in Regular Expressions [RegEx].
This section gives a basic introduction to the terminology that will be discussed in more detail in sections below.
In the context of formal standard, conformance refers to a set of rules or criteria whereby a relevant entity (element of information interchange, device, application, piece of hardware, etc., etc.) can be determined to either be meeting or not meeting the specification in the standard. In general, a formal standard will have a conformance clause or clauses, which will be stated in terms of conditionals ("X is in conformance with Y specification of this standard if Z") or modals ("An X that conforms with Y specification of this standard SHALL Z"). The modal verbs that standards language generally associates with such statements may themselves be carefully defined, and typically involve specialized usage of "SHALL" and "MUST", to avoid any ambiguities of interpretation. If a standard is complex, the conformance clause or clauses themselves may also be complex. But on occasion, a conformance clause may simply be stated along the lines of "X is in conformance with this standard if it follows the specification in section W", where section W may consist of hundreds of pages and constitute most of the rest of the standard.
The term compliance is often used synonymously with the term conformance and will be used that way in this Model.
Formal standards often distinguish between normative and informative content. This distinction may be highly conventionalized, or even be subject to rules specified in other standards, as for ISO standards, or the distinction may be much less formally maintained.
Normative content of a standard is that which is required for all of the conformance requirements to be meaningful. Typically a standard will have normative definitions for terms used in the rest of the specification, will have normative references to other standards or sources whose content is referred to indirectly, and will have normative clauses, specifications, or sections, which actually define the content of the standard itself -- that which the conformance clauses apply to.
Informative content of a standard is that material which has been added for clarification, but which, in the judgment of the standard's maintainers, could in principle be omitted without materially affecting the specification which the conformance clauses refer to. If a standard is changed over time, the status of some particular content could change from informative to normative, or vice versa, depending on whether it was newly required for conformance or became no longer required for conformance.
In the context of the Unicode Conformance Model, conformance verification means an external (third party) determination that a particular relevant entity actually does meet one or more requirements of the conformance clauses of the standard. Thus while conformance is merely a logical statement of requirements, verified conformance is a state met when entity X is actually determined, under some specified set of circumstances, to meet the logical statement of requirements. While conformance clauses exist in the standard on their own, conformance verification implies the existence of conformance tests, applied to entities in order to make such determinations.
A standard may include tests or "benchmarks" as part of the text of the standard, or as external documents associated with the standard. Once again, while there is some overlap in general usage of the terms "conformance test" and "conformance verification tests", in the Unicode Conformance Model a systematic distinction is drawn between the two.
A conformance test for the Unicode Standard is a list of data certified by the Unicode Technical Commitee [UTC] to be "correct" in regard to some particular requirement for conformance to the standard. In some instances, as for example, the implementation of the bidirectional algorithm, producing a definitive list of correct results is difficult or impossible, and in such cases, a conformance test may itself consist of an implemented algorithm certified by the UTC to produce correct results for any pertinent input data. Conformance tests for the Unicode Standard are essentially benchmarks that someone can use to determine if their algorithm, API, etc., claiming to conform to some requirement of the standard, does in fact match the data that the UTC claims defines such conformance.
A conformance verification test for the Unicode Standard, on the other hand, is a test, usually designed and implemented by a third party not associated with the Unicode Standard or the UTC, intended to test a product which claims conformance to one or more aspects of the Unicode Standard, for actual conformance to the standard. Thus a conformance verification test is a test *of a product*. Such a test, may, of course, make use of one or more of the Unicode conformance tests in order to determine the results of its verification of conformance.
The term support, in the context of the Unicode Conformance Model, refers to a more generalized claim of intent to conform to one or another requirement of the standard. A claim of Unicode support may in fact be difficult to verify, since it can be and often is vague in detail. But in principle, at least, it indicates that the developer or user of an entity intends conformance. More specifically, support often refers to a claim of particular repertoire coverage. For example, an application may claim support for Unicode Greek. That should be interpreted as meaning that Unicode Greek characters will be handled in conformance with the standard, and furthermore that all other relevant aspects of processing of those characters which that particular application is concerned with, will also be done in such a way as not to violate conformance clauses of the standard.
Some formal standards are developed once and then are essentially frozen and stable forever. For such standards, stability of content and the corresponding stability of conformance claims is not an issue.
For a large, complex standard aimed at the universal encoding of characters, such as the Unicode Standard, such stability is not possible. The standard is necessarily evolving and expanding over time, to extend its coverage of all the writing systems of the world. And as experience in its implementation accumulates, further aspects of character processing also accrue to the formal content of the standard. This fundamentally dynamic quality of the Unicode Standard complicates issues of conformance, since the content to which conformance requirements pertain continually expands, both horizontally to more characters and scripts, and vertically to more aspects of character processing.
Invariance refers to those aspects of the content of the Unicode Standard that have been determined to be unchangeable, even as the standard continues its dynamic development. A fairly trivial example can be seen in the guarantee of the stability of the formal Unicode character names. While in principle such names could be changed, and in very early versions of the standard were changed (between Version 1.0 and Version 1.1, for example), the [UTC] has determined that such changes are too disruptive and have too little benefit to be tolerated. Accordingly, the stability of character names has been promoted to the status of an invariant in the standard.
A further discussion of invariance and invariants can be found in [Property Model]. Invariants guard against change for the sake of change, or technological drift, but they also prevent the correction of clerical errors. In a standard as large and complex as the Unicode Standard, that is not a negligible issue. For a discussion of the tradeoffs and current list of invariants see [Stability Policies].
Conformance claims need to be distinguished in terms of their relationship to invariants and non-invariants in the standard, because of their different risk levels for stability.
The Unicode Standard is regularly versioned, as new characters are added. A formal system of versioning is in place, involving three levels of versions
all with carefully controlled rules for the type of documentation required, handling of the associated data files, and allowable types of change between versions. For more information about the details of Unicode versioning see [Versions]. Other standards developed by the Unicode Consortium may use a single level versioning scheme.
Conformance claims clearly must be specific to versions of the Unicode Standard, but the level of specificity needed for a claim may vary according to the nature of the particular conformance claim being made.
If a technical deficiency in the specifications of the Unicode Standard is identified, it may be corrected by a change in the next version, of, if sufficiently important, by a formal corrigendum. A corrigendum often applies to several earlier versions. Implementations can claim conformance to any of these versions with the given corrigendum applied. For more on Corrigenda see [Versions]
Errata are used to describe other known defects in the text. Unlike corrigenda they cannot be referenced in a conformance claim. For more information on errata see [Errata].
This section will serve as a guide to unraveling the particular way that the Unicode Standard expresses conformance requirements, both in terms of where they are located and how they are expressed. It also explores the peculiar aspects of conformance related to the synchronized status of the Unicode Standard and the independent but closely aligned International Standard ISO/IEC 10646, which has its own conformance clauses expressed using ISO conventions.
Chapter 3 of [Unicode] contains formal definitions of terms referenced in the conformance clauses. While modifications of these definitions between versions of the Unicode Standard have been, and will continue to be necessary, the numbering of the definitions is kept reasonably stable.
The conformance clauses in chapter 3 of [Unicode] define the requirements for a conformant implementation. They are expressed in terms of the definitions, but also refer to additional specifications contained in Unicode Standard Annexes. While modifications of these clauses between versions of the Unicode Standard have been, and will continue to be necessary, the numbering of the clauses is kept reasonably stable.
A Unicode Standard Annex (UAX) contains part of the standard, published as a standalone document. The relation between conformance to the Unicode Standard and Unicode Standard Annexes is spelled out in detail in <Section 3.2> of ]Unicode].
Unicode algorithms are specified as a series of logical steps. In many cases, the input to the algorithm is a string of character properties, in other words, the results of the algorithm are identical for different input strings, as long as each input string maps to the same string of character property values. Conformance to a Unicode algorithm does not require repeating the steps as described, but rather to achieve the same outputs for the same inputs. This provides the necessary flexibility for implementations to persue optimizations. Whether or not conformance to a given algorithm is required by Unicode Conformance, implementations claiming to implement one of these algorithm, must do so in conformance with its specifications.
Some algorithms provide explicit methods for tailoring, or customizing a general algorithm to the needs of a specific language, locality or application. Other algorithms simply describe the best, default practice and customization is assumed for any practical application (an example of this is the line breaking algorithm in [LineBreak]). Whether or not conformance to a given algorithm is required by Unicode Conformance, implementations claiming to implement one of these algorithms, must disclose the use of tailoring or customization.
For a detailed description see <Appendix D> of [Unicode].
There are several broad areas of application where Unicode Conformance makes specific types of requirements. Since not all applications and implementations cover all these areas, some aspects of Unicode Conformance may not be applicable to them from the start.
Representation covers all aspects of being able to express and transmit Unicode data. It is a requirement applicable to certain protocols (e.g. XML), but might apply to the storage aspects of databases and other file formats as well. Conformant representation applies to correct use of encoding forms and encoding schemes as well as the ability to represent all Unicode code points. In addition, issues related to Normalization [Normal] are important.
Conformant transcoding between Unicode and all other, so called legacy, character encodings. retains the identity of the transcoded characters. In additions, concerns related to Normalization are important.
String processing generically covers all operations on Unicode texts that can be carried out without considering layout and specifically not considering fonts. String processing encompasses a large variety of operations, including, but not limited to text segmentation, text parsing, handling regular expressions, searching, sorting as well as creating formatted text representation of data types. For a number of these operations model algorithms and other specifications exist to which an implementation may claim conformance, such as [UCA]. [RegEx], [Boundary], [LineBreak].
Layout comprises all operations that go from backing store to displayed text (and the reverse, for selection). These operations are dependent on font data, but are considered separately since the same implementation typically can work with a range of different fonts.
The Unicode Standard does not standardize the actual appearance of characters, but instead intends that they should be depicted within a customary range of design interpretations. Conformance to the Unicode Standard then primarily refers to those tables in the fonts that correlate character codes with the glyphs in the font, for example CMAP tables, and to claims of "coverage" of Unicode repertoire by fonts.
Issues of coverage of Unicode repertoire, conversion of input to Unicode character values for storage, and consistency with the text models required for particular scripts and text layout. The entities here are mostly IME's and keyboards (drivers).
[ The following content is just sketched out in outline form. ]
This subsection will discuss the implications of the lack of a conformance requirement to support a minimum subset of the Unicode repertoire.
Full conformance is not necessarily the same as full support, as conformance requirements in many cases are minimal requirements. Exceptions are certain well defined areas as encoding forms or normalization that have few or no options and few or no levels.
This section will provide both a typology for levels of conformance (i.e., an alternative to the notion that all aspects of Unicode conformance are either/or issues), and specific lists of levels of conformance and support where they can be pulled out of the standard. For example, the standard explicitly talks about levels of surrogate support -- that should be abstracted, along with others, to provide the basis for determining how to make various claims of conformance.
This section could describe best practices of deciding levels of conformance or it could describe how conformance requirements relate to best practices in a given area.
[ The following content is just sketched out in outline form. ]
Conformant implementations will have to interact with both downlevel and uplevel implementations. This creates particular issues. The Unicode conformance requirements are structured to encourage implementations to passively support data containing characters assigned in future versions of the standard.
It is generally not helpful to tag data created by an implementation with the version level of Unicode supported by that implementation. This is because the repertoire of that version of Unicode is far larger than the actual set of characters used in the data. In fact, a large part of text data created and interchanged worldwide can be represented in all versions of Unicode. Therefore, the version level of the implementation bears little relation to the repertoire needed to cover the data.
Most implementations will not equally support the entire repertoire of Unicode characters for a given version. In fact, there is no conformance requirement to support any specific part of the repertoire. Therefore, even if the version level of a receiving implementation is higher than that of the creating implementation there is no guarantee that both support the repertoire covered by the data, or support it equally well.
[Unicode] defines no method for enumerating or identifying common sub-repertoires of the standard, but ISO/IEC 10646 does so. Implementations can use the [DerivedAge] for each character code to avoid sending character codes to a downlevel system that cannot be known to it. Since character coding is strictly additive, Implementations receiving data can easily identify characters that are not defined in the version of the standard to which they conform and take appropriate action. In many cases, appropriate action consists of passing through such data, or treating them as characters possessing default properties. (See UTR#23: Unicode Character Property Model [PropModel] for more details on default properties).
A mere matching of version numbers between an implementation and components it relies on will not be sufficient, because components may subset the repertoire they support or choose a different level of conformance, where available.
|[Boundaries<]||UAX# 29: Text Boundaries
|[Charts]||The online code charts can be found at http://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found at http://www.unicode.org/charts/charindex.html|
|[Errata]||Updates and errata to the Unicode Standard, as well as other technical standards developed by the Unicode Consortium can be found at http://www.unicode.org/errata|
|[Feedback]||Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html|
|[FAQ]||Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.|
|[Glossary]||Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.|
|[LineBreak]||UAX# 14: Line Breaking Properties
|[Normal]||UAX #15: Normalization Forms
|[Property Model]||Unicode Technical Report #23, The Unicode Character Property Model, http://www.unicode.org/reports/tr23/|
|[RegEx]||UTS #18: Unicode Regular Expressions,
|[Reports]||Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.|
|[Stability]||Unicode Stability Policy for Character Encoding and Character Properties http://www.unicode.org/standard/stability_policy.html|
|[UCA]||UTS #10: Unicode Collation Algorithm,
|[UCD]||Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files|
|[Unicode]||The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.|
|[UTC]||The Unicode Technical Committee, see http://www.unicode.org/consortium/utc.html for more information on procedures etc.|
|[Versions]||Versions of the Unicode Standard http://www.unicode.org/standard/versions For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.|
|Working Draft 3||Third Working Draft, adding material to sections 5 and 6 [AF]|
|Working Draft 2||Second Working Draft, incorporating feedback from the editorial committee meeting on 4/20/04 [AF]|
|Working Draft 1||Initial Working Draft, based on Document L2/165 [KW], with added boilerplate and TR formatting and some additional edits, including a number of new or expanded subsections [AF]|
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.