Plane 14 redux (was: Same language, two locales)

From: Doug Ewell (
Date: Sat Sep 02 2000 - 14:58:13 EDT

Peter Constable <> wrote:

>> The revision will also provide ways to use 3166-2 country-
>> subdivision codes and (draft) ISO 15924 script codes in language
>> tags.
> I don't think there is a concensus on use of script codes.

Possibly not. I was reading a draft which may be changed before

>> Naturally, the revised version will not be called RFC 1766, but will
>> be assigned a new number. I don't know if UTR #7 will be updated to
>> refer to the new RFC when it is published (I think it should be).
> I don't think UTR#7 should be making any normative reference to any
> system of language identifiers. Unicode is providing a set of
> characters; it should be up to some other protocol to specify how
> those will be used.

I have heard this claim before, and the strong impression I get (please
correct me if I am wrong, Peter) is that the writer really doesn't like
Plane 14 language tags and wants to discourage their use.

It's fine not to like them. No protocol I know of requires them, and
you are certainly free not to use them and to ignore them. They are
quite easy to strip out of incoming Unicode text (start stripping when
you reach U+E0001 and finish when you reach any character not in the
range U+E0020 to U+E007F).

However, I don't think it's OK to water down the mechanism to the point
where it becomes useless for those who do see value in it. If any old
text is allowed after U+E0001 LANGUAGE TAG, then English (e.g.) could
be represented not only by "en", "en-us", "en-gb" and the like, but by
"e", "eng", "English", "anglais", and so many others that no parser
could ever hope to recognize all the variants. The result would be
that none would bother to try.

UTR #7 falls into a somewhat different category from other Unicode
mechanisms. If it should not specify a system of language identifiers,
then what "other mechanism" do you propose to take on this
responsibility? If you say, "use higher-level protocols such as HTML
or XML," then your needs are already met, because you can already do
this today in HTML or XML (although I believe they too specify ISO 639
and 3166). But if you really want to provide language identification
in PLAIN UNICODE TEXT -- not marked-up or rich text -- then Plane 14
tags are the only way to fly.

As I have said before, I think UTR #7 should be strengthened so that it
not only specifies the format of language tags, but does so by referring
*directly* to ISO 639 and 3166 instead of through RFC 1766, and it
should make the format normative to the Technical Report (not, of
course, to Unicode proper) instead of "suggesting" it. Obviously, my
viewpoint runs directly opposite to Peter's. I look forward to a
lively and informed debate on this.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT