Re: Plane 14 language tags

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Jun 29 2000 - 10:38:22 EDT


Kenneth Whistler <kenw@sybase.com> wrote:

> Rather, it is just a suggestion that since case is not significant in
> the language tags, it is slightly preferable to do the early
> "normalization" (i.e. case folding to lowercase, in this instance),
> rather than emitting arbitrarily mixed case tags and distributing the
> case-folding burden to all the interpreters of the tags.

but I think the "burden" is both trivial and unavoidable, as Murray
Sargent <murrays@microsoft.com> pointed out:

> ... if you know ch is in an ASCII range (0 - 0x7F or 0xE0000 -
> 0xE007F), you can do a case insensitive compare as quickly as a case
> sensitive one. The problem with assuming lower case is that the input
> might not all be in lower case.

Since the TR doesn't require but merely "recommends" that the entire
tag be lowercased, and since it explicitly states that this convention
"would not be required for Unicode conformance," any application that
reads language tags will have to interpret them in a case-insensitive
manner anyway. (However, I give in; I will use "en-us" instead of
"en-US" when generating language tags.)

Perhaps the TR should be modified to remove this ambivalence, either by
*requiring* the entire tag to be lowercased or by avoiding the issue
altogether.

Antoine Leca <Antoine.Leca@renault.fr> wrote:

> Also note that Plane 14 tags are stored in surrogate form when UTF-16
> is used (which happens quite often on some well known operating
> system). So they are stored using (WCHAR_T[2])({0xDB40, 0xDC00 -
> 0xDC7F}). So the | with 0x20 should be done *only* on the second
> surrogate codepoint, because if done on the first, result will be
> offseted by 0x8000 (0xE8020 - 0xE803F and 0xE8060 - 0xE807F).

Well, of course. To do anything else would indicate a complete lack of
understanding of surrogates (language tag or otherwise).

Michael Everson <everson@egt.ie> wrote:

> It seems to me that the Plane 14 tags shouldn't have both uppercase
> and lowercase if you aren't going to let people use the uppercase.

Good point. Maybe the uppercase won't be discouraged for other types
of tags, if any.

> (It seems to me that the Plane 14 tags were a very silly idea, but
> oh well.)

Section 4.12 addresses this nicely. Plane 14 tags are not required by
any process, and are certainly not expected in HTML; but they do deter
certain objectors in Asia from writing RFCs and proposals claiming that
Unicode/10646 is "unusable" and recommending the use of ISO 2022
instead.

Brendan Murray <brendan_murray@lotus.com> responded to Murray Sargent's
comment about folding case with "ch | 0x20":

> Except, of course, in Turkey where the lowercase of 'I' is not 'i'
> and the uppercase of 'i' is not 'I'.

True, but irrelevant to Plane 14 tags, which duplicate only the 7-bit
ASCII range, where 'I' and 'i' are forced to get along with each other.

Thanks to all for your comments. Has anyone actually used these tags
yet?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT