Re: FW: RE: Last Call: Language Tagging in Unicode Plain Text to Prop

From: Kenneth Whistler (
Date: Fri Jul 10 1998 - 16:17:23 EDT

Chen Qifan responded to Rick McGowan's reply:

> [Chen, Qifan] It does but to a limited extend. Let me explain why.
> First, it has the requirement that not all text are to be used for the
> tagging purpose.
> But we are talking about Unicode. Why one subset of characters in it
> should be
> more important than others?

The reason is that the tag characters are specified to be used for
language tagging, and the standard they refer to for the content
of language tags is RFC 1766, which only uses a subset of ASCII
characters to express language tags.

In the future, Plane 14 tag characters might be used to express other
types of tags than language tags, but even in that case, they would
likely be restricted to some subset of ASCII characters. This is because
tag syntaxes are almost always deliberately limited to the "portable"
character set (cf. POSIX), so that they can be expressed using nearly
any encoding of any encoded character set. Cf. the syntax of HTML
tags, for example.

The Plane 14 tag characters are deliberately *not* intended for
general annotation of text. Any general annotation scheme *should*
make use of any and all Unicode characters, and should obviously not
be restricted to ASCII values.

> Second, allowing all Unicode characters in tags (excluding BEGIN/CANCEL
> tag char) does not impose heavy overhead to distinguish tags from non
> tags. A single
> state is enough and it can be done by a several lines of code.

There is a vast distinction between stateful processing and stateless
processing. It is true that maintaining a single state can be done
with just a few lines of code, but the existence of a state makes a
significant difference for how string handling has to be done, and
complicates string processing in pieces through buffers, for example.

The Plane 14 tag character proposal is written so that all Plane 14
tag characters can be ignored or skipped without maintaining a
processing state. That was one of the requirements of the design,
since it does not disrupt existing processing by Unicode implementations.
Adding a required state for determining whether a particular character
was or was not a tag character would break existing implementations.

> Third, what happen in the future that non-ascii characters are needed in
> tags? How do we meet that requirement?

If a tag syntax emerges that a) contains a repertoire beyond the
portable character set and b) must be used for some reason inline
in Internet protocols in the same way that some protocols require
the use of language tags, then the addition of tag characters to
cover such a tag syntax could be addressed on a case-by-case basis.
At the moment, there is no such confluence of requirements, so there
is no need to add such characters now.

--Ken Whistler

> To me, adding tagging characters to Unicode is a good idea. But the
> prosposed
> solution is not general enough and a simpler solution exists.
> Hope this explains the idea.
> --qifan

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT