[Unicode] Unicode Corrigenda Tech Site | Site Map | Search
 

Corrigendum #9: Clarification About Noncharacters

 

Corrigendum Effective Date Applicable Versions Fixed Version Result Documented In:
Corrigendum #9: Clarification About Noncharacters 2013-Jan-30
[134-C15]
3.1.0 to 6.3.0 7.0.0
2014-June
Chapter 3, Conformance

Background

The formal wording of the definition of noncharacter in the standard has led some implementers to interpret any presence of a noncharacter code point in a Unicode string as causing that string to be ill-formed, and thereby has led to inappropriate over-rejection of some Unicode strings in APIs, components, or applications that should handle (i.e., either process or pass through) all well-formed Unicode strings.

Noncharacters in the Unicode Standard are intended for internal use and have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange nor do they cause ill-formed Unicode text. This has always been the intent of the standard, as expressed by the Unicode Technical Committee. This is necessary for the effective use of noncharacters, because anytime a Unicode string crosses an API boundary, it is in effect being "interchanged". Furthermore, for distributed software, it is often very difficult to determine what constitutes an "internal" versus an "external" context for any particular software process. The real intent of noncharacters is that they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged.

Corrigendum #9 provides a means for implementations that openly interchange noncharacters to claim conformance to versions of the standard in which Definition D14 nominally prohibits such interchange. This corrigendum does not affect the fact that when so interchanged, the intended semantics of noncharacters may not be interpretable.

Changes to the Content of the Core Specification

Change D14 in Section 3.4, Characters and Encoding, as indicated:

Noncharacter: A code point that is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.

Note that in Unicode 3.1.0 through Unicode 4.1.0, the definition in question was labeled D7b, instead of D14.

There is associated informative text in the Core Specification concerning noncharacters. That text will also be clarified when the text of this corrigendum is applied in a future revision of the Core Specification.


Access to Copyright and terms of use