Re: Beyond 17 planes, was: Java char and Unicode 3.0+

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 16 2003 - 13:59:00 CST


From: "Asmus Freytag" <asmusf@ix.netcom.com>

> At 08:03 AM 10/16/03 -0700, Peter Kirk wrote:
> >Or perhaps a way can be found to graciously retire UTF-16 in some distant
> >future version of Unicode. That is likely to become viable long before
the
> >extra planes are needed.
>
> This discussion is a pure numbers game. Since no-one can define a hard
> number for a cut-off that's guaranteed to be good 'forever', all we have
is
> probability. (That's all we have anyway, whether in life or science). So
> the question becomes an estimate of probability.
>
> 128 charaters (ASCII) cover 80% of the characters needed by 5% of the
> world's population
> 256 characters (Latin-1) covers 80% of the characters needed by 15% of the
> world's population
> 40,000 characters (Unicode 1.0) covers 95% of the characters needed by 85%
> of the worlds population
> 90,000 characters (Unicode 4.0) covers 98% of the characters needed by 95%
> of the world's population
>
> Exercise for the reader:
>
> Warmup:
> Where do the other 910,000 characters come from, and who's using them?

We're not discussing about addition of characters standardized by joint
efforts
of Unicode's UTC and ISO's WG2, and I'm not expecting a lot of changes in
this
area. But about a more general scheme in which the Unicode/ISO10646 would
become a part of a larger set of standards for encoding something else than
just pure text. There are already attempts to encode attributed text, and
mixing/interleaving text and object data with a unified encoding scheme.

For now the inclusion of codepoints like the Object Replacement Character is
demonstrating that mixing text and other data in a single unified and
serialized
stream is already an issue. Of course there's now XML to add structure to
this
content, but unstructured data also has its applications, everywhere as a
predefinite schema cannot be designed.

Also, there's some needs to allow designers of glyph libraries to encode
them
and exchange them, using privately alocated codepoints, without risking
collision between each PUA assignments. As PUA characters are not designed
to be interchanged, the other solution could be based on private reservation
in a global registry similar to reservation in the IPv4 space. Then the
codepoint
usages can be privately agreed upon between collaborating companies that
wish to unify their own codesets, and reduce their assignment (a process
similar to IP space aggregation and renumbering, something that has some
technical issues but is solvable in a medium term).

In fact this interchangeability of PUA codepoints is still an unsolved
issue,
that could be solved in a way similar to IPv4 assignments under the IANA
authority. Nothing needs to be changed for the current 17 planes managed
and assigned to Unicode/ISO10646, as long as UTC&WG2 accept that they
will not need to manage centrally all character assignments for every
limited
group.

Due to that, there's a big risk that PUAs start being permanently assigned
as part of a OS core charset, and that data created on distinct systems
become mutually incompatible as they are using colliding subsets of PUAs
(this is already the case in core fonts and script processors used in
MS Windows, and a few private characters/logographs used by Apple in
MacOS).

There's a huge number of candidate corporate logographs that could be
reserved simply for usage within a unified scheme including Unicode, and
that could be negociated within a IANA registry, with a reservation system
similar to domain names. In addition, adding such a system could generate
some revenues to help finance Unicode and ISO10646 activities: these
private assignments become interchangeable as long as their registration
is active in the registry.

We could even imagine to implement this system within a special domain
and use rDNS requests to get a resolved domain name corresponding to
an assigned codepoint: this domain could then contain info on how to
get glyphs or fonts or information supporting this private codepoint.
These glyphs could be protected with digital rights or privacy and could
even include registered logos, graphics, designs, ... and even colorful
photographs and artworks.

I could imagine a lot of other similar applications... This does not
contradict the Unicode/ISO10646 goals which is to keep the 17 planes
open to everybody use and publicly accessible for global interchanges
of information, by a strict policy describing the correct usage of
codepoints assigned and unified by ISO's WG2 and Unicode.org's UTC.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST