ISO
INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC 1/SC 2/WG 2

Universal Multiple-Octet Coded Character Set
(U C S)

ISO/IEC JTC1/SC2/WG2 N 2175
Date: 2000-03-07

Title: 

Proposal to restrict the range of code positions to the values up to U-0010FFFF

Source: 

Unicode Technical Committee

Status: 

Liaison contribution

Action: 

Request for Technical Corrigendum to ISO/IEC 10646-1:2000

The design of UTF-16 permits addressing up to 10FFFF16 characters, which represents over 1,000,000 code positions. It has become clear that this range of code positions is sufficient for all foreseeable character allocations. Yet the difference in representation between UTF-16 and the other encoding forms UTF-8 and UCS-4 causes continued confusion and pointless disagreements among developers and users, and an unnecessary schism between Unicode and 10646. The Unicode Consortium requests the publication of a technical corrigendum to ISO/IEC 10646-1:2000 that addresses this issue.

The design of UTF-16 permits addresses up to 10FFFF16, which represents over 1,000,000 characters. When SC2/WG2 decided on this size, it was four times the estimated number of characters to be encoded in ISO/IEC 10646. Current experience bears this out; no expert sees the number of possible encoded characters coming anywhere near 1,000,000.

In fact, WG2 itself decided that the encoding space would not go beyond 17 planes. Refer to the minutes of WG2 meeting #24, held in Washington DC in 1993. The minutes are in document 955; the documents discussed were 916 [US requirements for 10646] and 918 [proposing UTF-16 under the name UCS-2E]. However, this WG2 decision was not reflected in the standard.

This situation presents unnecessary interoperability problems for implementers. In particular, there are allocated ISO 10646 characters that cannot be valid Unicode characters: the private use characters from U-60000000 to U-7F000000. Although these characters are not used in practice, they represent a real, practical problem when defining conformance requirements in a host of products and standards. If this situation continues, there is little recourse but to define special versions of UTF-8 and UCS-4 that are restricted to the same domain as UTF-16 for interoperability with UTF-16.  (See UTR #19: http://www.unicode.org/unicode/reports/tr19/.)

The Unicode consortium requests a remedy to this situation: the publication of a technical corrigendum to ISO/IEC 10646-1:2000 which excludes values above U-0010FFFF. In this corrigendum,

The proposed corrigendum does not change the overall architecture of 10646. To achieve this, for example, UTF-8 is left with a possible 6 byte form even though in practice those forms would always be reserved and never used. Suggested text for the technical corrigendum is provided below.

If this situation is addressed, developers do not have to worry about interoperability between the different encoding forms, and terms such as UTF-32 become simple aliases for UCS-4. This will remove a needless barrier in the path to making ISO/IEC 10646 / Unicode the truly universal character set.


Suggested text for the technical corrigendum

Clause 9.1 Planes reserved for future standardization
Current: Plane 11 to DF in Group 00 and Planes 00 to FF in Groups 01 to 5F are reserved for future standardization, and thus those code positions shall not be used for any other purpose.
New: Plane 11 to FF in Group 00 and Planes 00 to FF in Groups 01 to 7F are reserved for future standardization, and thus those code positions shall not be used for any other purpose.

NOTE - For interoperability between UTF-8, UTF-16 and UCS-4, it is not expected that any code positions will ever be allocated above U-0010FFFF.

Clause 10.2 Code positions for private use characters
Current: The code positions of the 32 groups from Group 60 to Group 7F shall be for private use.
New: <deleted>
Current: The code positions of Plane 0F and Plane 10, and of the 32 planes from Plane E0 to Plane FF, of Group 00 shall be for private use.
New: The code positions of Plane 0F and Plane 10 of Group 00 shall be for private use.
Figure 2 - Group 00 of the Universal Multiple-Octet Coded Character Set
Current: Private use planes 0F, 10, E0 - FF
New: Private use planes 0F, 10.
<and delete arrows pointing to planes E0 - FF>
Annex A
Current: 400 PRIVATE USE PLANES G=00, P=0F, 10, & E0 - FF
New: 400 PRIVATE USE PLANES G=00, P=0F, 10

 

Current: 500 PRIVATE USE GROUPS G=60 - 7F
New: 500 GROUPS G=60 - 7F
<or delete 500 and leave a note regarding use of collection id 500 in earlier versions of the standard>