Re: Plane 14 redux (was: Same language, two locales)

Date: Sun Sep 03 2000 - 01:09:48 EDT


>> I don't think UTR#7 should be making any normative reference to any
>> system of language identifiers. Unicode is providing a set of
>> characters; it should be up to some other protocol to specify how
>> those will be used.
>I have heard this claim before, and the strong impression I get (please
>correct me if I am wrong, Peter) is that the writer really doesn't like
>Plane 14 language tags and wants to discourage their use.

I'm not a big fan of them, but I'm not dead set against them. I don't plan
to use them, but I don't care if others want to use them.

But, the problems with UTR#7 making a normative reference to a particular
system for language identification are (a) that systems get revised (RFC
1766 will become obsolete before long), and (b) that it's doing so in the
absence of any given context or application (apart from saying that it's
plain text). What if, as you suggest, someone in a given context would
rather use ISO 639-2? The Unicode Consortium shouldn't care, and that
person's data shouldn't be deemed non-conformant to the Unicode Standard
simply because they used ISO 639-2 rather than RFC 1766. The Unicode
Consortium should only care that *characters* get used in a particular way;
it's kind of like a UTR specifying that "color" must be spelled without a
"u" - making rules about how characters can be combined in areas that have
nothing to do with the properties of the characters themselves.

>However, I don't think it's OK to water down the mechanism to the point
>where it becomes useless for those who do see value in it. If any old
>text is allowed after U+E0001 LANGUAGE TAG, then English (e.g.) could
>be represented not only by "en", "en-us", "en-gb" and the like, but by
>"e", "eng", "English", "anglais", and so many others that no parser
>could ever hope to recognize all the variants. The result would be
>that none would bother to try.

But you already said that no protocol requires them, so even if you specify
a particular system of identifiers there's no guarantee that a parser will
recognize any of them (let alone all of them).

>UTR #7 falls into a somewhat different category from other Unicode
>mechanisms. If it should not specify a system of language identifiers,
>then what "other mechanism" do you propose to take on this
>responsibility? If you say, "use higher-level protocols such as HTML
>or XML," then your needs are already met, because you can already do
>this today in HTML or XML (although I believe they too specify ISO 639
>and 3166). But if you really want to provide language identification
>in PLAIN UNICODE TEXT -- not marked-up or rich text -- then Plane 14
>tags are the only way to fly.

Let's understand something. Language tags composed of plane 14 characters
are a form of markup, and I'd say that a document that contains them isn't
strictly speaking plain text. It's just that the markup is done in a way
that's different from other, more familiar markup mechanisms.

>As I have said before, I think UTR #7 should be strengthened so that it
>not only specifies the format of language tags, but does so by referring
>*directly* to ISO 639 and 3166 instead of through RFC 1766,

Well, if you're going to require a particular system, changing from RFC
1766 to ISO 639 & 3166 would be a step in the wrong direction. There are
already problems to be overcome with RFC 1766, and even with an extension
that might include ISO 639-2. Moving to a more limited system would only
aggravate some of the problems. (I'll be presenting on this topic next week
- come and hear, if you're interested.)

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT