Timetables and conventions (was RE: Chapter on character sets)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 16 2000 - 13:17:16 EDT


Doug Ewell asked:

> Two questions:
>
> 1. What is the projected timetable for the first version of Unicode that
> contains character assignments beyond Plane 0? I'm just wondering,
> not trying to seem impatient. (Really.)

Unicode 3.1 is still tentatively scheduled to appear as a technical
report late this fall. The key event is the progressing of the international
ballotting on ISO/IEC 10646-2. Resolution of comments on the current ballotting
is scheduled for the WG2 meeting in Athens, Greece, late in September.
After that date, assuming all goes well, technical issues will have been
resolved, and 10646-2 will move on to its last (DIS) stage of ballotting.

Again, assuming all goes well in Athens, we will then know the code points
and names that will be published eventually (early 2001?) in 10646-2,
and can proceed to produce the requisite additions to UnicodeData.txt
and Unihan.txt, to roll out Unicode 3.1 as a technical report.

>
> 2. How will UnicodeData.txt in particular be modified to represent the
> scalar values of characters beyond Plane 0? Will the first column
> use 5, 6, or 8 hex digits?

For characters from Planes 1..15, the first column will use 5 hex digits.
For characters from Plane 16, the first column will use 6 hex digits.

(Note that Planes 15 and 16 are completely devoted to private use
characters, so no standard characters will ever be assigned in Planes
15 or 16 anyway.)

The same conventions will be used for citation of characters in Planes
above Plane 0 in Unicode Technical Reports and in the eventual republication
of the standard itself. In textual citations, the normal usage will
include the "U+" prefix: U+1D141, etc.

> Will the scalar values of Plane 0
> characters continue to use only 4 hex digits?

Yes.

> What compatibility
> problems might be introduced?

Well, the obvious ones. Parsers of the Unicode Character Database files
will have to be modified if they have built-in assumptions that
character values are always 4-digit hex values. Now they should be
extended to allow for 6-digit hex values in the data files, and they
should be prepared to cope with integers in the range 0..0x10FFFF,
rather than just integers in the range 0..0xFFFF.

The standard has been warning people about UTF-16 for a long time now,
but now it is time for people to really bite the bullet, and be
prepared for the extended scalar range for assigned characters,
and the surrogate pairs needed for their expression in UTF-16.

> > You can see this trend already on the list when people are discussing
> > characters under ballot for 10646-2. They are referred to by their
> > scalar values, and not by surrogate pairs, except when something about
> > the UTF-16 encoding form is what is at issue.
>
> Maybe this refers to the unicore list, because (regrettably) I haven't
> seen any discussion on this list of proposed characters beyond Plane 0
> where actual code points are specified.

Yes, I expect that most of the discussion is occurring on the unicore
list, since that is where the UTC issues regarding ballot positions,
etc., are discussed.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT