Re: RE: RE: Last Call: Language Tagging in Unicode Plain Text to Prop osed

From: Rick McGowan (
Date: Fri Jul 10 1998 - 16:55:34 EDT

Qifan wrote:

> [Chen, Qifan] It does but to a limited extend. Let me explain why.
> First, it has the requirement that not all text are to be used for the
> tagging purpose.

Yes. Exactly! It was SPECIFICALLY DESIGNED this way. On purpose. In
other words, YES, we KNOW that it only allows a limited set of characters for
tags. It was DESIGNED that way! The mechanism is stateless, as John Cowan
has pointed out, and allows tags to be easily identified and stripped, etc.

> But we are talking about Unicode. Why one subset of characters in it
> should be more important than others?

It is not intended as a general means of encoding metasyntax, it is for
encoding things like language tags where the TAGS are defined using
restricted source sets. It's better to have the set of tag characters act
like something tangible, such as an alphabet like ASCII, rather than being
random bit sequences from Plane 14... (or cloning the whole of Unicode).

> A single state is enough and it can be done by a several lines of code.

That is still a state. It requires using some characters with a meaning
different from their normal meaning, in a state-ful way. That is not
acceptable for this purpose and was rejected in the design.

> Third, what happen in the future that non-ascii characters are needed in
> tags? How do we meet that requirement?

Nothing will happen. The set is fixed and is intended to remain that way --
a small set. This mechanism doesn't require CLONING the entire set of
Unicode characters, and doesn't require ADDING things to the tag set every
time Unicode/10646 is updated. It's a small fixed set of characters with NO
OTHER MEANING than as tag characters.

The point is NOT to provide a generalized means of encoding tags that can
contain anything and be used for any metasyntactic purpose; it is much more
restricted than that. It is not a FLAW that the set is restricted, it is a
specific design decision. This came out of a long technical discussion
within the Unicode Technical Committee and representatives from IETF to solve
particular problems in an out-of-band way.

I don't have time to go into great detail about WHY it is the way it is. To
understand that, you must look at this in the context of Internet protocols.
 I'd suggest you read the current and previous ACAP proposals, as well as
the MLSF document of Mark Crispin to understand the historical context from
which this proposal derives. Then you will probably understand why it is
this way.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT