L2/02-398

Date/Time:    Sun Nov  3 09:38:26 EST 2002
Contact:      verdy_p@wanadoo.fr

Proposal to support the full UCS-4 repertoire by extending the UTF-16 transformation format:

The UTF-16 transformation format allows mapping only UCS-4 characters in range 0 to 0x10FFFF using either a a single 16-bit Unicode code-points, in range 0 to 0xD7FFF or 0xE000 to 0xEFFF, or a pair of surrogates code points:
- a leading high surrogate in range 0xD800 to 0xDBFF, encoding planes 1 to 0x10 of group 0, followed by
- a final low surrogates in range 0xDC00 to 0xDFFF, encoding the lowest 10 bits of the UCS-4 code-point.

ISO 10646-1 defines standard codepoints for user defined codes in groups 0x60 to 0x7F. They currently cannot be mapped to UTF-16, so it is already an incompatibility with Unicode.

Font designers need a way to define glyph decompositions for many languages, notably for South Asian text, in order to implement the ordering rules. Even though glyphs are not encoded in Unicode, because they are not characters, they are already described by Unicode reference texts and there will be a need to specify glyph code positions in the future to allow font interoperability, using standard glyph composition rules to map characters. This normailization will likely be performed by ISO 10646, which will probably assign code-points in the UCS-4 repertoire, out of the 17 first planes currently covered by Unicode, and Unicode should prepare this.

There should be a standard way to extend UTF-16 to comply with the full UCS-4 repertoire, by allowing encoding the 7 bits of the UCS-4 group number and the 8 bits of the UCS-4 plane number before a surrogates pair that encode the lowest 16 bits of the UCS-4 code-point. The total number of surrogates needed would be 4 reserved code-points.

Other applications would be to assign code-points in UCS-4 for break rules and collation rules, by creating new transcoding tables of Unicode code-points (of the first 15 standardized planes for text character encoding).

Usage of this new extended UTF-16 encoding would not be allowed for text encoding but purely for interoperability of code that will manage the ordering, collation, breaking or rendering of text encoded within the existing Unicode set. This encoding would be used for applications that can't use the UTF-8 serialization, because of storage issues (when the minimum code-unit storage is 16-bits, or because of memory alignment constraints or performance issues)

I propose to define 3 or 4 new subranges in Unicode to allow complete transformation of UCS-4:

1) First proposal with 3 subranges:
It reuses the 0xDC00 to 0xDFFF subrange for the trailing low surrogates, encoding 10 bits
The remaining 21 bits are split and encoded into three surrogates for:
- the 7-bits UCS-4 group number, followed by
- the 8-bits UCS-4 plane number, and followed by
- the high 6 bits of the UCS-4 row number,
It is terminated by the 10-bit of trailing low surrogate in range 0xDC00 to 0xDFFF.
The encoding sequence is then:
- A Lead Group Surrogate in the new range 0xAA80 to 0xAAFF, encoding plane numbers 0x00 to 0x7F
- A Medium Plane Surrogate in the new range 0xAB00 to 0xABFF, encoding plane numbers 0x00 to 0xFF in that group
- A Medium Row Surrogate in the new range 0xD7C0 to 0xD7FF, encoding the high 6 bits of the UCS-4 row number in that plane
- The standard Trailing Low Surrogate in the existing range 0xDC00 to 0xDFFF
Strict conformance would require that the leading surrogate pairs (0xAA80,0xAB00) to (0xAA80,0xAB10) be made invalid as it would not be the shortest form to serialize planes 0 to 0x10 of group 0, i.e. the existing Unicode planes.

This encoding may create an ambiguity for the trailing low surrogate (in range 0xDC00 to 0xDFFF), as its previous serialized positions would either 1 leading high surrogate in (0xD800-0xDFFF), or a sequence of 3 leading surrogates in (0xAA80-0xAAFF,0xAB00-0xABFF,0xD7C0-0xD7FF).

However, given that a trailing surrogate in 0xDC00-0xDFFF must be tested so that it has at least one previous surrogate in range 0xD800-0xDFFF, it does not create a security issue, if this test is extended to allow the range 0xD7C0-0xDFFF for the previous code-unit. In that case, if the previous code-unit is in 0xD800-0xDBFF, we know that this encodes a Unicode character. Else it encodes a non-character in UCS-4 and we know that we have 3 leading surrogates.

Each range of this proposal is distinct so that we always know how many surrogates must follow or can precede any code-position.
This proposal requires allocating 3 ranges: a 6-bits range, an 8-bits range, and a 7-bits range, i.e. 64+256+128=448 code-points.

2) Second Proposal with 4 subranges:
If the ambiguity about the trailing surrogate is considered dangerous, we will need to allocate a separate range for the trailing surrogates. We could still keep avoid needing more than 4 surrogates, but allocating a large 10-bits range is not feasable. Instead we could allocate an 8-bits range for the new trailing low surrogates (that encode the UCS-4 cell numbers in a row), two 8-bits ranges for the row number and page number, and one 7-bits range for the group number, for a total of 128+256*3=896 code-points.

However this alternate proposal requires twice more code-points to reserve in the BMP. We cannot allocate any of these ranges in the currently unallocated area between the existing Hangul Syllables and the Surrogates Area, and we will have to move them all before the Hangul Syllables area.

I think that the first proposal is more efficient, and the ambiguity of the final surrogate is not critical, as the validity constraints fix the behavior.