UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)

From: Jill Ramonsky (Jill.Ramonsky@aculab.com)
Date: Thu Oct 16 2003 - 08:35:02 CST

Here's an alternative idea.

In UTF-16, as it's currently defined, codepoints in the range U+010000
to U+10FFFF are represented as some High Surrogate (HS) followed by some
Low Surrogate (LS). Also, as currently defined, any HS not followed by
an LS, or an LS not preceeded by an HS, is illegal.

So, to create even higher codepoints still, all you have to do is use
some currently illegal sequences. For example:

HS + LS => 10 bits from HS plus 10 bits from LS (as now)
[This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000
giving an actual range of U+10000 to U+10FFFF]

HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus
10 bits from LS
[This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add
0x110000 giving an actual range of U+110000 to U+4010FFFF]

HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS
plus 10 bits from third HS plus 10 bits from LS
[This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add
0x40110000 giving an actual range of U+40110000 to U+1004010FFFF]

This system can be extended indefinitely, and conflicts with current
UTF-16 only in that it gives meaning to currently illegal sequences.
Observe, however, that it is still always possible to distinguish and
"end" surrogate from a "start-or-middle" surrogate, and that if you
start parsing a sequence in the middle, it will always be possible to
step either backwards or forwards to determine the start or end of a
codepoint sequence.


> -----Original Message-----
> From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
> Sent: Thursday, October 16, 2003 2:33 PM
> To: unicode@unicode.org
> Subject: Re: Java char and Unicode 3.0+ (was:Canonical equivalence in
> rendering: mandatory or recommended?)
> From: "John Cowan" <cowan@mercury.ccil.org>
> > Philippe Verdy scripsit:
> >
> > > I am also doubting, but I would not bet on it. After all,
> when Unicode
> > > started, a single plane was considered waaaaaay more than
> sufficient
> too.
> >
> > I not only would bet on it, I actually have a bet on it.
> Henry Thompson
> > of the W3C's Schema WG bet me that we'd outrun the existing
> planes within
> > five years; four left to go and no sign of it, even if
> Michael Everson
> > were to achieve pluripresence and actually get everything
> accepted into
> > the standard that he knows needs to be done.
> Just for the case it would be needed, are you keeping an
> unassigned range
> in the BMP so that extension will remain possible to preserve
> an ascending
> compatibility or support for UTF-16 which currently is the
> main reason why
> there are for now 17 planes defined ?
> (for example in the range between Hangul syllables and
> existing surrogates)
> That's OK not to document is officially for now, but it seems
> that a prudent
> and conservative policy to keep such a range available in the BMP
> for the future is needed. Of course, if there's an evolution,
> this would
> require a later update to the current UTF-8 and UTF-16
> conforming rules.
> I'm not asking to document it now, but to keep it in mind and
> not fully
> filling the BMP so that UTF-16 would become impossible to upgrade to
> the possible future scheme (such provisions already exist
> natively in UTF-8
> and UTF-32, since its origin by X/Open and their initial
> documentation in
> a RFC).

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST