Re: fictional scripts revisited

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 22 2001 - 19:01:18 EST


Thomas Chan noted:

>
> > > At the inception of various other fictional scripts, no one could foresee
> > > the growth of scholarly and/or amateur interest in them;
> >
> > True. That's why we wait until there is, before we consider encoding
> > a script.
>
> Yes, I agree. It is harder to find historical scripts and characters than
> to create new ones, and it is the latter, especially for large sets like
> the two I raised, whose rate of inclusion must be tempered.

I have a couple of points to make on this general topic.

1. The rate at which WG2 and the UTC can encode characters is finite.

The recent massive inclusion of Vertical Extension B is a one-time
special case. It reflects the accumulation of all the known major Han
sources over a decade, dumped rather suddenly into the standard when
the non-BMP encoding began. That rate of accumulation of Han characters
cannot be sustained, since the dam break suddenly drained the reservoir.
We can expect, of course, that the IRG will come up with another 1000
here and another 1000 there over the upcoming years, but it is likely
to take a decade to even fill Plane 2 with such additions now.

When considering all *other* characters in Unicode, some interesting
statistics turn up. Here are approximate figures for non-Han characters
encoded per year, using the Unicode Standard publication milestones
as the rough benchmarks:

1989 - 1991 (Unicode 1.0) 3549 chars/year

1991 - 1993 (Unicode 1.1) 2933 chars/year

1993 - 1996 (Unicode 2.0) 1572 chars/year

1996 - 2000 (Unicode 3.0) 931 chars/year

2000 - 2001 (Unicode 3.1) 1693 chars/year

Note the downward trend. The anomaly of Unicode 3.1 derives entirely
from the encoding of approximately 1000 mathematical alphanumeric
characters, which could be handled as one big chunk (and which drained
much of the accumulated mathematics source pool).

The downward trend results from the same draining of the source pool
phenomenon. As all the obvious candidates are encoded, it gets
harder and harder to pull together the documentation and the will to
encode ever more obscure minority and historic scripts or
special use characters.

The trend will continue. Unicode 3.2 is likely to take about a
year to publish, and will add about 1000 non-Han characters. After
that Unicode 4.0 has even less committed in terms of new characters
to encode.

Han aside, given the remaining pool of candidate characters and
the known working methods of the two standardizing committees, I
consider it unlikely that the standard will ever again exceed 1000
characters/year encoded after the publication of Unicode 3.2.
There might be individual bumps here and there, as when Tangut or
Egyptian hieroglyphics finally get done, for example, but the
longterm trend is clear.

So, what does this mean for "precious codespace"? For the sake
of argument, cede Planes 2 and 3 to the Han lexicographers. That leaves
788,414 available code points for everything else. (7793 on the BMP,
63,843 on Plane 1, 655,340 on Planes 4 - 13, and 61,438 on the
part of Plane 14 not currently reserved for format characters)
At a maximum average rate of 1000 characters/year that the committes
can push through standardization, that would be 788 year's worth
of space to encode.

In other words, unless someone manages to wrest the standard away
from the two committees and puts up a public website with an
"Encode Your Character Here For Free and Enter Our Sweepstakes!"
interface, I'm not going to worry about "precious codespace" and
neither should anybody else.

In fact, some of us on the inside of this process have the opposite
worry. As the Unicode Standard is more and more widely implemented
and put into use, the aggregated incremental costs of additions to the
standard are rising, since people have to update more and more implementations
to deal with the new characters and features added to the standard.
This results in an ever-accumulating resistance to change from more
and more directions that will eventually rise to the level that
makes it nearly impossible to add new things to the standard.
The race is actually on to see that minority scripts and historic
scripts are covered before the resistance to additions reaches the
level that makes such additions infeasible.

2. While the New English Calligraphy (NEC) phenomenon is interesting,
it is unlikely to ever be a serious candidate for standardization.
Why? Because it is promulgated by artists as *anti*-writing:

"When people try to recognize and write these words, some of the
thinking patterns that have been ingrained in them since they
learned to read are challenged. It is the artists' belief that
people must have their routine thinking attacked in this way. While
undergoing this process of estrangement and re-famliarizing with
one's written language, one can be reminded that the sensation
of distance betwen other systems is self-induced."

This is the very antithesis of what standardization of writing
systems is about. And as a form of calligraphy, NEC (as for other
calligraphies) is not about encoded characters, but about artistic
expression (or other artistically related agendas) based on
writing. Attempting to *standardize* NEC in terms of encoded characters
would be to attempt to pigeonhole it back into the units needed
for standard information interchange of textual data, but would
make no sense from the artistic point of view. It would be perverse
indeed.

And I think this is rather distinct from the kind of situation that
Tengwar and Cirth are in. Those were constructed for literary purposes
as writing systems. And they are used as such in published material.
Encoding them would have a benefit for those who wish to
interchange such material. And encoding them poses no particular
issues for the Unicode Standard. The argument is only over the
marginal costs of the encoding and the ideological positions
regarding what is *deserving* of attention for encoding.

So to return to the topic of this thread, there are "fictional
scripts" and then there are other things. In my opinion Tengwar and
Cirth clearly do belong in the Roadmap for 10646/Unicode, while
phenomena such as the New English Calligraphy do not.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT