Re: A certain committee?

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Oct 21 2003 - 00:29:27 CST


Jill Ramonsky wrote:

> Who were this "certain committee"? And why did they have so much
> control over the Unicode Consortium that they could force the
> introduction of a new character block that nobody had ever previously
> used? What was this "abuse of UTF-8" of which you speak. Indeed, what
> is an "abuse" of UTF-8? What does the phrase even mean?

The so-called "Multi-Lingual String Format" was described in an
Internet-Draft, draft-ietf-acap-mlsf-01.txt, written by Chris Newman of
Innosoft in June 1997. It was an attempt to define a lightweight,
inline language tagging protocol for ACAP (Application Configuration
Access Protocol) using invalid UTF-8 sequences, such as <E0 E5 EE> for
"en".

The protocol was described as "another layer of encoding on top of
UTF-8," but since there was no signature mechanism or other way for
UTF-8 processors to tell this MLSF from normal (corrupted) UTF-8 text,
it was effectively a non-standard extension of UTF-8.

At the time this was proposed, UTF-8 was still new and not very widely
adopted, and there was apparently great concern within the UTC that this
non-standard extension would undermine the stability of the UTF-8 format
(just as the tacit approval of non-shortest UTF-8 sequences was
criticized as a security hole years later). Plane 14 tags were
introduced as an equally lightweight countermeasure to persuade the ACAP
people to abandon MLSF in favor of an official tagging mechanism that
used real (but out-of-the-way) Unicode characters and did not break the
rules of UTF-8.

> How can you possibly add a block of characters to Unicode and then say
> "the UTC sincerely hopes that they never get used at all"?
> (Particularly when there are still people around whose actual real
> characters are still not being added).

First, the comparison between adding this special-purpose tagging
mechanism and adding "actual real characters" that are part of some
writing system is disingenuous. Nobody ever made a choice between
encoding Tai Lue, Rejang, or Plane 14 tags.

Second, there are those of us (outside the UTC) who do feel that Plane
14 language tags have a valid use, since not all text that may benefit
from language tagging is necessarily in a marked-up format. But the
writing is on the wall, and "those of us" have given up our battle.

> If this "certain committee" had intended to (falsely) declare
> something as UTF-8 and then embed something like:
>
> <XXX>lang=en-uk<YYY>
>
> where <XXX> and <YYY> are invalid UTF-8 byte-sequences, then so what?
> That would simply mean that "a certain committee"'s code wouldn't then
> interoperate with the rest of the world. Why is that any business of
> the UC's?

Because they were publishing their mechanism as an Internet-Draft, which
would soon have graduated to being an RFC, and then other groups might
have picked it up. Again, if you think back to 1997, the most commonly
referenced definition of UTF-8 itself was an RFC.

> Hell, if only the KLI had thought to implement the Klingon alphabet in
> invalid UTF-8 sequences - then maybe the UC would have added Klingon
> characters just to shut them up, saying things like "it's not really a
> script", and "the UTC sincerely hopes that they never get used at
> all". Could have saved an awful lot of time!

With all respect, this completely misrepresents the intent and working
process of the UTC.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST