Re: Plane 14 redux (was: Same language, two locales)

From: Peter_Constable@sil.org
Date: Sun Sep 03 2000 - 13:47:32 EDT


>> But, the problems with UTR#7 making a normative reference to a
>> particular system for language identification are (a) that systems
>> get revised (RFC 1766 will become obsolete before long),
>
>This is one reason I have suggested making reference to ISO standards
>in the past, rather than RFCs. When ISO standards get revised, they
>retain the number and name of the earlier version, so documents that
>reference those standards are *automatically* updated to the new ISO
>revision.

Only if carefully worded. This is one of the problems with RFC 1766 that
has led to the current efforts to provide a replacement.

>I admit that my later comment, "UTR #7 falls into a somewhat different
>category from other Unicode mechanisms," touches on a gray area. I
>would suggest, however, that the characters we are discussing are not
>the normal ASCII alphabet from U+0020 to U+007E, but rather the special
>tag characters from U+E0020 to U+E007E. Unlike ASCII, these characters
>are for use only within tags, so it might be legitimate for UTR #7 to
>specify exactly how they are to be used.

I'm inclined to say that it makes perfect sense for UTC to provide a set of
characters for use in tags only, and to also specify the formal structure,
i.e. the syntax, for their use. But there's no reason why these couldn't be
used for any kind of tagging, if someone wanted, not just language tags. It
doesn't make sense for UTC to specify the namespace for every tagging
system that these might ever get used for. And so I'm inclined to say that
it's reasonable for Unicode to say, "Here are these tagging characters, and
here are examples of how they might be used," but then leave it up to other
standards to specify actual cases of usage. Just as Unicode provides a
character set, but doesn't define XML. Just as Unicode provide math
characters, but doesn't define MathML. If the plane 14 tag characters were
requested by another body, it should be that other body specify a usage for
them, methinks.

>> Let's understand something. Language tags composed of plane 14
>> characters are a form of markup, and I'd say that a document that
>> contains them isn't strictly speaking plain text. It's just that the
>> markup is done in a way that's different from other, more familiar
>> markup mechanisms.
>
>Arghhmmgmhmmm. Let's look at the definition of "plain text" in the
>Glossary of TUS 3.0 (p. 993):
>
>"Computer-encoded text that consists *only* of a sequence of code
>values from a given standard, with no other formatting or structural
>information. Plain text interchange is commonly used between computer
>systems that do not share higher-level protocols."
>(original emphasis)
>
>Plane 14 characters are code values from the Unicode Standard (or will
>be as soon as a suitable version of Unicode refers to them). They do
>not employ any formatting or other mechanism external to the Unicode
>Standard. If Plane 14 characters can be considered markup, then so can
>directional overrides, layout controls, and even C0 controls like CR
>and LF. Of course, nobody would ever consider CR and LF *not* to be
>plain text, so where do we draw the line? I suggest simply observing
>the line the Unicode Consortium has drawn.

But by the same argument, an XML document might be considered plain text.
It consists only of a sequence of character codes from some character
encoding standard, with no additional data structure (looking at it in
terms of byte sequences, and how byte sequences are interpreted). But
obviously that's not how we think of an XML file - some of the characters
are interpreted as content, and some of the characters are interpreted as
meta-information. The same is true of a file of character data that
includes plane 14 language tags. After all, they are called tags, and tags
are generally equated with markup.

>I am intensely interested, bot unfortunately my work schedule probably
>won't permit any travel at present...

Too bad. The paper will be online soon.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT