[Unicode]   The Standard Home | Site Map | Search
 

Unicode 3.0.1

Version 3.0.1 has been superseded by the latest version of the Unicode Standard.

Version 3.0.1 of the Unicode Standard consists of the core specification, The Unicode Standard Version 3.0, as modified by the textual updates specified on this page, the code charts for this version (currently only available in hard copy), the Unicode Standard Annexes, and the 3.0.1 Update of the Unicode Character Database (UCD).

A complete specification of the contributory files for Unicode 3.0.1 is found on the page Components for 3.0.1. That page also provides the recommended reference format for this version of the Unicode Standard.


Contents of This Document

Overview

Unicode 3.0.1 is an update version. It does not contain character additions or major normative changes. See also Corrigendum #1 for an important corrigendum applicable to Unicode 3.0.1 which modifies the conformance requirements for UTF-8.

Unicode Standard Annexes and Unicode Technical Standards

There are two new categories of approved Unicode Technical Reports that have been given more authoritative status by the Unicode Technical Committee: Unicode Standard Annex (UAX) and Unicode Technical Standard (UTS). For more information, see About Unicode Technical Reports.

Several of the Unicode Standard Annexes have also been updated in this version of Unicode. Of particular interest is the conformance test for normalization.

Unicode Character Database

Three new data files have been added to the Unicode 3.0.1 release:

BidiMirroring.txt (UAX #9: The Bidirectional Algorithm)

  • Informative properties for substituting characters in an implementation of bidirectional mirroring.

CaseFolding.txt (UTR #21: Case Mappings)

  • Informative file mapping characters to their case-folded form.

NormalizationTest.txt (UAX #15: Unicode Normalization Forms)

  • Normative test file for conformance to Unicode Normalization Forms.

Stability Policies

In each new version of the Unicode Standard, the Unicode Consortium may add characters or make certain changes in characters that were encoded in a previous version of the standard. To minimize the impact on existing implementations, however, there are limitations imposed by the consortium on the types of changes that can be made. See Unicode Character Encoding Stability Policy for more information.

Changes to the Text of Version 3.0

The following describes the textual updates that have been made. All references to sections and page numbers are to The Unicode Standard, Version 3.0. For detailed changes, struck-through indicates deleted text; underline indicates added text.

Notation

Section 0.2, Notational Conventions, page xxviii: change the description of the U+ notation to read:

In running text, an individual Unicode code point can be expressed as U+n, where n is from four to six hexadecimal digits, using the digits 0-9 and A-F (for 10 through 15, respectively). There should be no leading zeros, unless the codepoint would have fewer than four hexadecimal digits; for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

Noncharacters

There are 34 distinguished code points in Unicode that are characterized as noncharacters. To clarify the status of all 34, a definition (page 41) is added, and conformance rules C5 and C10 (pages 38, 39) are amended as follows:

D7b Noncharacter: a code point that is permanently reserved for internal use, and that should never be interchanged. In Unicode 3.0, these consist of the values U+nFFFE and U+nFFFF, where n is from 0 to 1016.
  • For more information, see the discussions under "Special Noncharacter Values" in Section 2.7 Special Character and Noncharacter Values, and under "Noncharacters" in Section 13.6 Specials.
  • These code points are permanently reserved as noncharacters. In the future, it is possible that additional code points may be specified to represent noncharacters.
C5 A process shall not interpret either U+FFFE or U+FFFF a noncharacter code point as an abstract character.
  • The code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
C10 A process shall make no change in a valid coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points, if that process purports not to modify the interpretation of that coded character sequence.
  • If a noncharacter which does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point.

Unassigned Code Points

Section 5.3, Unknown and Missing Characters: Unassigned and Private Use Character Codes, page 108-109: add the following to the end of the subsection:

In practice, applications must deal with unassigned code points or unknown private use characters. This may occur, for example, when the application is handling text that originated on a system implementing a later release of Unicode, with additional assigned characters. To work properly in implementations, unassigned code points must be given default properties as if they were characters, since various algorithms require properties to be assigned to every character in order to function at all. These properties are not uniform across all unassigned code points, since certain ranges of code points need different properties to maximize compatibility.

Normally, code points outside the repertoire of supported characters would be displayed with a fall-back glyph, such as a black box. However, format and control characters must not have visible glyphs (although they may have an effect on other characters in display). These characters are also ignored except with respect to specific, defined processes: for example, ZERO WIDTH NON-JOINER is ignored in collation. To allow a greater degree of compatibility across versions of the standard, the ranges U+2060..U+2069 and U+E0000..U+E1000 are reserved for future format and control characters (General Category = Cf). Unassigned code points in these ranges should be ignored in processing and display.

The Unicode Bidirectional Algorithm assigns a Bidirectional Category to unassigned code points based on the expected direction of characters to be added in the future. For more information, see Bidirectional Character Types in UAX #9: The Bidirectional Algorithm.

UAX #14: Line Breaking Properties supplies the property "XX" for all unassigned code points in Definitions.

In determining character widths for East Asian display, UAX #11: East Asian Width includes a section on Unassigned and Private Use characters.

In normalization, unassigned code points are given the Canonical Combining Class = 0, and no decomposition mapping.

Identifiers

Section 5.16, Identifiers: Specific Character Additions, page 134: the subsection name is changed to Specific Character Adjustments, and the following note is added:

Note: a useful set of characters to consider for exclusion from identifiers consists of all characters whose compatibility mappings have a <font> tag.

Numeric Separators

Section 6.1, General Punctuation, Punctuation: U+0020-U+00BF, page 149: the following note is added:

Note: any of the characters U+002E, U+002C, U+060C, U+066B, or U+066C (and possibly others) can be used as numeric separator characters, depending on the locale and user customizations.

Controlling Ligatures

In some orthographies the same letters may either ligate or not, depending on the intended reading. To account for this, the semantics of the ZWNJ and ZWJ have been extended.

Section 13.2 Controlling Ligatures, page 318: the text is superseded by the following:

To allow for finer control over ligature formation, in Unicode 3.0.1 the definitions of the following characters has been broadened to cover ligatures as well as cursive connection:

zero width non-joiner U+200C ZERO WIDTH NON-JOINER

  • The intended semantic is to break both cursive connections and ligatures in rendering.

zero width joiner U+200D ZERO WIDTH JOINER

  • The intended semantic is to produce a more connected rendering of adjacent characters than would otherwise be the case, if possible. In particular:
    1. If the two characters could form a ligature, but do not normally, ZWJ requests that the ligature be used.
    2. Otherwise, if either of the characters could cursively connect, but do not normally, ZWJ requests that each of the characters take a cursive-connection form where possible.
      • In particular, if a character X on one side has a cursive form, and the other character Y does not, ZWJ requests that X take a cursive form.
    3. Otherwise, where neither a ligature nor cursive connection are available, the ZWJ has no effect.

In other words given three broad categories below, ZWJ requests that glyphs in the highest available category (for the given font) be used; ZWNJ requests that glyphs in the lowest available category (for the given font) be used:

  1. unconnected
  2. cursively connected
  3. ligated

For those unusual circumstances where someone wants to forbid ligatures in a sequence XY, but promote cursive connection, the sequence X<zwj><zwnj><zwj>Y can be used. The <zwnj> breaks ligatures, while the two adjacent joiners cause the X and Y to take adjacent cursive forms (where they exist). Similarly, if someone wanted to have X take a cursive form but Y be isolated, then the sequence X<zwj><zwnj>Y could be used (as in previous versions of Unicode).

Examples

The following provide samples of desired renderings when the joiner or non-joiner are inserted between two characters. In the Arabic examples, the characters on the left side are in visual order already, but have not yet been shaped.

Sample Display Actions
sample display actions showing Latin and Arabic ligatures

Implementation Notes

For modern font technologies, such as OpenType or AAT, font vendors should add ZWJ to their ligature mapping tables as appropriate. Thus where a font had a mapping from "f" + "i" to fi, the font designer should add the additional mapping from "f" + ZWJ + "i" to fi. On the other hand, ZWNJ will normally have the desired effect naturally for most fonts without any change, since it simply obstructs the normal ligature/cursive connection behavior. As with all other alternate format characters, fonts should use an invisible zero-width glyph for representation of both ZWJ and ZWNJ.

Current Arabic shaping algorithms should need no change; optional ligatures just would not be promoted by ZWJ, but current text should not be affected. The reason is that the current use of ZWJ between characters that normally cursively connect was redundant in previous versions of Unicode and should occur in very few instances. (As a matter of fact, with bad implementations of ZWJ or with unsupported ZWJ, the cursive connection would actually be broken.)