L2/99-191

1999-06-11

Author: Mark Davis

For addition to the Unicode Standard:

Unassigned Characters

In practice, applications must deal with unassigned code points. This may occur, for example, when the application is handling text that originated on a system implementing a later release of Unicode with additional assigned characters. To work properly in implementations, unassigned code points must be given default properties as if they were characters, since various algorithms require properties to be assigned to every character in order to function at all. These properties are not uniform across all unassigned code points, since certain ranges of code points need different properties to maximize compatibility.

The Unicode Bidirectional Algorithm assigns directional properties based on the expected direction of characters to be added in the future. All unassigned code points in Hebrew, Arabic, Thaana, and Syriac blocks are given the bidirectional property R (right-to-left). These are the ranges 0590-05FF, FB1D-FB4F, 0600-07BF, FB50-FDFF, and FE70-FEFF. All other unassigned code points are given the bidirectional property L (left-to-right).

Normally, code points outside the repertoire of supported characters would be displayed with a fall-back glyph, such as a black box. However, format and control characters must not have visible glyphs (although they may have an effect on other characters in display). These characters are also ignored except with respect to specific, defined processes: for example, ZERO WIDTH NON-JOINER is ignored in collation. To allow a greater degree of compatibility across versions of the standard, the ranges 2060-2069 and 000E0000-000E1000 are reserved for future format and control characters. Unassigned code points in these ranges should be ignored in processing and display.

For addition to the 10646:

Part 1, end of clause 8 (Basic Multilingual Plane)

To allow a greater degree of compatibility across versions of the standard, the ranges U-00002060..U-00002069 are reserved for future format and control characters. Unassigned code points in these ranges should be ignored in processing and display.

Part 2, end of clause 8 (General Purpose Plane)

To allow a greater degree of compatibility across versions of the standard, the ranges U-000E0000..U-000E1000 are reserved for future format and control characters. Unassigned code points in these ranges should be ignored in processing and display.