Re: lowercased Unicode language tags ? (was: ISO 15924)

From: Doug Ewell (dewell@adelphia.net)
Date: Mon May 03 2004 - 10:32:48 CDT


Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Note that I'm focussing on problems that may arise from RFC 3066.
> There's no problem in fact with ISO 639, ISO 3166 or ISO 15924
> isolately. The problem is clearly in the ambiguous syntax of RFC 3066
> once modified to include optional script codes followed by optional
> country+region code.

The syntax is not ambiguous, in either RFC 3066 or its proposed
successor (called RFC 3066bis). RFC 3066 does not allow script codes
except in registered combinations, so they can be looked up. In RFC
3066bis, script codes come BEFORE country codes, and ISO 3166-2 region
codes are not allowed anyway. If you read the documents you will see
this. (I'll try to find a URL for RFC 3066bis so you can download and
read it.)

> OK suppose now that one requires an hyphen between a country and
> region code. Isn't there some region code with 4 letters in ISO
> 3166-2 that may collide with ISO 15924 codes? I have a partial list of
> ISO 3166-2, most codes are 1 or 2 letters or digits.

They can be up to 3 letters or digits long. Again, though, even if they
could be 4 letters long, there would be no collision because even in RFC
3066bis, you cannot have a country code followed by a script code.

> All ambiguities could be avoided if an updated RFC 3066 with script
> codes says that letercase is significant for the distinction of
> ISO15924 Script codes, and ISO3166 country/area codes.

Not needed.

> Still, ISO3166-3 contains 4 letter codes as well which have legal use.
> Are they allowed in RFC 3066 language tags?

No, they are not. I said that already. There is no need to encode
"Myanmar, as spoken in the region formerly known as Burma but now known
as Myanmar" separately from "Myanmar, as spoken in Myanmar."

> All the new combinations cause a problem when one wants to support all
> the forms:
>
> <languagecode>-<COUNTRYCODE>
> <languagecode>-<ScriptCode>
> <languagecode>-<Scriptcode>-<COUNTRYCODE>
> <languagecode>-<COUNTRYCODE>-<SUBCOUNTRYCODE>

Even your example, which includes an impossible case, shows that there
is no ambiguity. Let's break it down by number of characters, ignoring
the leading <languagecode>:

-2
-4
-4-2
-2-1, -2-2, or -2-3

Where's the ambiguity?

> It's impossible, in a parser, to distinguish them without compiling a
> list of allowed code (but the 3 ISO standards are open to
> extensions...), unless case distinction is made mandatory in RFC 3066
> language tags.

I will soon be building a parser that will distinguish them without any
problem, regardless of case. If you want, I'll send it to you when I'm
done. (It'll be part of the new version of LTag that will support RFC
3066bis, if and when that is approved.)

> In that case, the Unicode 4 TUS specification that says that language
> tags should be lowercased would be non conforming in the context of
> RFC 3066 language tags where case distinction is important, as soon as
> an optional script code can be used now as a subtag.
>
> If Unicode does not want to change the legacy use of lowercased ISO
> 3166 country/region codes converted to lowercase, an exception could
> be made so that the ISO 15924 script code will NOT be lowercased but
> specified with its normative titlecased form.

None of this is needed. Please find an example that conforms to the
syntax of either RFC 3066 or RFC 3066bis that exhibits ambiguity. If
you can find even one, I will humbly apologize.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT