RE: FW: RE: Last Call: Language Tagging in Unicode Plain Text to Prop

From: Chen, Qifan (qifan.chen@austx.tandem.com)
Date: Fri Jul 10 1998 - 16:42:20 EDT


> -----Original Message-----
> From: kenw@sybase.com [SMTP:kenw@sybase.com]
> Sent: Friday, July 10, 1998 3:17 PM
> To: qifan.chen@austx.tandem.com
> Cc: unicode@unicode.org; kenw@sybase.com; paf@swip.net
> Subject: Re: FW: RE: Last Call: Language Tagging in Unicode Plain
> Text to Prop
>
>
> Chen Qifan responded to Rick McGowan's reply:
>
> > [Chen, Qifan] It does but to a limited extend. Let me explain why.
> >
> > First, it has the requirement that not all text are to be used for the
> > tagging purpose.
> > But we are talking about Unicode. Why one subset of characters in it
> > should be
> > more important than others?
>
> The reason is that the tag characters are specified to be used for
> language tagging, and the standard they refer to for the content
> of language tags is RFC 1766, which only uses a subset of ASCII
> characters to express language tags.
>
        By allowing any Unicode in a tag does not exclude
        ASCII. Why should this new Unicode standard be limted by RFC1766?
        Should the Unicode standard be better than others, as it has the
better, larger
        character set to manipulate?

> In the future, Plane 14 tag characters might be used to express other
> types of tags than language tags, but even in that case, they would
> likely be restricted to some subset of ASCII characters. This is because
> tag syntaxes are almost always deliberately limited to the "portable"
> character set (cf. POSIX), so that they can be expressed using nearly
> any encoding of any encoded character set. Cf. the syntax of HTML
> tags, for example.
>
        Same point. Again, I do not think ASCII should be excluded,
        nor others.

> The Plane 14 tag characters are deliberately *not* intended for
> general annotation of text. Any general annotation scheme *should*
> make use of any and all Unicode characters, and should obviously not
> be restricted to ASCII values.
>
        With the modification, Unicode tagging can handle
        RFC1766 and enable more general tagging scheme.

> >
> > Second, allowing all Unicode characters in tags (excluding BEGIN/CANCEL
>
> > tag char) does not impose heavy overhead to distinguish tags from non
> > tags. A single
> > state is enough and it can be done by a several lines of code.
>
> There is a vast distinction between stateful processing and stateless
> processing. It is true that maintaining a single state can be done
> with just a few lines of code, but the existence of a state makes a
> significant difference for how string handling has to be done, and
> complicates string processing in pieces through buffers, for example.
>
        There is a point here. But I do not think adding a simple state
        during protocol processing is that significant.
         
> The Plane 14 tag character proposal is written so that all Plane 14
> tag characters can be ignored or skipped without maintaining a
> processing state. That was one of the requirements of the design,
> since it does not disrupt existing processing by Unicode implementations.
> Adding a required state for determining whether a particular character
> was or was not a tag character would break existing implementations.
>
        I suppose these implementations can not handle
        the proposed tagging characters as of today. If that is true, those
        implementations have to be modified anyway.

> >
> > Third, what happen in the future that non-ascii characters are needed in
> > tags? How do we meet that requirement?
>
> If a tag syntax emerges that a) contains a repertoire beyond the
> portable character set and b) must be used for some reason inline
> in Internet protocols in the same way that some protocols require
> the use of language tags, then the addition of tag characters to
> cover such a tag syntax could be addressed on a case-by-case basis.
> At the moment, there is no such confluence of requirements, so there
> is no need to add such characters now.
>
> --Ken Whistler
>
> >
> > To me, adding tagging characters to Unicode is a good idea. But the
> > prosposed
> > solution is not general enough and a simpler solution exists.
> >
> > Hope this explains the idea.
> >
> > --qifan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT