Re: Timetables and conventions (was RE: Chapter on character sets)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 16 2000 - 14:16:10 EDT


Antoine asked:

>
> Kenneth Whistler wrote:
> >
> > The same conventions will be used for citation of characters in Planes
> > above Plane 0 in Unicode Technical Reports and in the eventual republication
> > of the standard itself. In textual citations, the normal usage will
> > include the "U+" prefix: U+1D141, etc.
>
> Ah, that is new!

Yes it is new. The UTC took this decision recently, as it had to decide
how to proceed with extensions for the data files -- and no one really
wanted to jump to 8-digit character representations.

>
> It was my understanding that we do not use U+5F (_), U+410 (Cyrillic A), etc.

Correct. The shortest U+ representation is 4 digits. For Planes 1..15,
it will be 5 digits, and for Plane 16, 6 digits.

> The U+ notation is carefully described in ISO 10646 (so I think), and
> I remember reading that U+xxxx is the same as U-0000xxxx (which means
> that there is a relationship between UCS-2 and UCS-4), so I expected
> U-0001D141 instead.

The U- notation will be unchanged, and is an alternative that people
can use if they wish.

However, it was the considered opinion of the UTC and of the
editors of the Unicode Standard, that 1D141 (instead of 0001D141)
in the data files, and U+1D141 (instead of U-0001D141) in textual
citations, will simply be more convenient, easier to understand,
and less error-prone.

The UTC intends to take up this notational issue with WG2 through
a technical corrigendum to 10646, so that 10646 will also allow
the 5 and 6-digit U+ forms as an official 10646 short name for
characters.

> > Parsers of the Unicode Character Database files
> > will have to be modified if they have built-in assumptions that
> > character values are always 4-digit hex values. Now they should be
> > extended to allow for 6-digit hex values in the data files, and they
> > should be prepared to cope with integers in the range 0..0x10FFFF,
> > rather than just integers in the range 0..0xFFFF.
>
> Since they are already assigned, I believe that Ken can very easily
> create a 3.1-alpha release, with material identical to 3.0 except two
> added lines, one with 0xF000 <First extended PU>, and a second
> with 0x10FFFF <Last extended PU>. So certainly people could adjust
> their parsers "live".
> Prospective ;-)

Actually, it is currently under discussion for the Unicode 3.0.1
update version release, which is imminent. (No new characters, just some
minor fixes for some data files, etc.)

UnicodeData.txt, which currently contains entries like:

E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;

May be extended to contain corresponding entries:

F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
FFFFD;<Plane 15 Private Use, Last>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;

which would give everybody an easy test to see if their parsers
croak!

--Ken

>
> Antoine
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT