Re: Addition of remaining two Maltese Characters to Unicode

From: Peter_Constable@sil.org
Date: Tue Aug 01 2000 - 15:04:30 EDT


>Naturally, solution 1 would be simpler - given the availability of space
in
>the Latin Extended area, I don't really see what the fuss is all about.
>Historically, 'gh' and 'ie' have always been regarded as single characters
-
>if 'ie' was represented by some unique strange symbol, no one would ever
>think about not including it together with the rest of the Maltese
>characters already in Unicode.

If it was some strange symbol, then it would be *new* - there would not be
any existing way to encode it. But it's not a strange symbol, it's "ie",
and there is an existing potential way to encode it, <0069, 0065>.

The fuss is about this: it may seem like there's a bunch of available space
and that this is just a couple of characters. But the reality is that it's
not just a couple of characters at stake here, it's hundreds. There is a
steady flow of requests for digraphs or precomposed base&diacritic
characters for very much the same reasons: "in my language, it's a unit and
not a sequence, and it has it's own behaviours". But in almost every case,
there is another way to deal with the behaviours. In the mean time, UTC
already has a huge workload, and the prospect of evaluating proposals for
lots of digraphs and precomposed graphemes probably isn't too appealing,
but it's not even that that stops them. It's that there are deeper issues
that people aren't always aware of, things that pertain to implications for
existing implementations and existing data. Add a digraph <ie>; this means
a new decomposition, <ie> -> <i><e>, which means people need to revise
their software; then there's the fact that people won't encode data
consistently (count on it), and existing data won't magically update
itself; and because of issues along these lines, there will be situations
in which it will be *necessary* to encode using the decomposed form anyway
(e.g. domain names)... This one small addition has a big set of
consequences, but in the end the payoff is minimal or none. Multiply that
by hundreds of times for all of the other digraphs and precomposed
characters out there in other languages (and with perhaps a couple of
thousand languages in the world currently written with Latin script, there
are *lots*), because your two digraphs are just part way down a long queue
of very similar requests. That's what the fuss is about.

At this point, UTC has a default position: no new characters of this sort
will be accepted. If one is convincing enough, there can always be
exceptsion to the default, and the line of reasoning you've presented is
the right one: viz. "there are issues in the behaviour of this writing
system that cannot be adequately dealt with unless we add this new
character." But the arguments have to be convincing, and other approaches
to dealing with the problem have to be explored.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT