RE: lowercased Unicode language tags ? (was: ISO 15924)

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Mon May 03 2004 - 12:13:44 CDT


Doug Ewell wrote:

> this. (I'll try to find a URL for RFC 3066bis so you can download and
> read it.)

The URLs for "RFC3066bis" are...

Official version:

  http://www.ietf.org/internet-drafts/draft-phillips-langtags-02.txt

In HTML format:

  http://www.inter-locale.com/ID/draft-phillips-langtags-02.html

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Doug Ewell
> Sent: Monday, May 03, 2004 8:33 AM
> To: Unicode Mailing List
> Cc: Philippe Verdy
> Subject: Re: lowercased Unicode language tags ? (was: ISO 15924)
>
>
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > Note that I'm focussing on problems that may arise from RFC 3066.
> > There's no problem in fact with ISO 639, ISO 3166 or ISO 15924
> > isolately. The problem is clearly in the ambiguous syntax of RFC 3066
> > once modified to include optional script codes followed by optional
> > country+region code.
>
> The syntax is not ambiguous, in either RFC 3066 or its proposed
> successor (called RFC 3066bis). RFC 3066 does not allow script codes
> except in registered combinations, so they can be looked up. In RFC
> 3066bis, script codes come BEFORE country codes, and ISO 3166-2 region
> codes are not allowed anyway. If you read the documents you will see
> this. (I'll try to find a URL for RFC 3066bis so you can download and
> read it.)
>
> > OK suppose now that one requires an hyphen between a country and
> > region code. Isn't there some region code with 4 letters in ISO
> > 3166-2 that may collide with ISO 15924 codes? I have a partial list of
> > ISO 3166-2, most codes are 1 or 2 letters or digits.
>
> They can be up to 3 letters or digits long. Again, though, even if they
> could be 4 letters long, there would be no collision because even in RFC
> 3066bis, you cannot have a country code followed by a script code.
>
> > All ambiguities could be avoided if an updated RFC 3066 with script
> > codes says that letercase is significant for the distinction of
> > ISO15924 Script codes, and ISO3166 country/area codes.
>
> Not needed.
>
> > Still, ISO3166-3 contains 4 letter codes as well which have legal use.
> > Are they allowed in RFC 3066 language tags?
>
> No, they are not. I said that already. There is no need to encode
> "Myanmar, as spoken in the region formerly known as Burma but now known
> as Myanmar" separately from "Myanmar, as spoken in Myanmar."
>
> > All the new combinations cause a problem when one wants to support all
> > the forms:
> >
> > <languagecode>-<COUNTRYCODE>
> > <languagecode>-<ScriptCode>
> > <languagecode>-<Scriptcode>-<COUNTRYCODE>
> > <languagecode>-<COUNTRYCODE>-<SUBCOUNTRYCODE>
>
> Even your example, which includes an impossible case, shows that there
> is no ambiguity. Let's break it down by number of characters, ignoring
> the leading <languagecode>:
>
> -2
> -4
> -4-2
> -2-1, -2-2, or -2-3
>
> Where's the ambiguity?
>
> > It's impossible, in a parser, to distinguish them without compiling a
> > list of allowed code (but the 3 ISO standards are open to
> > extensions...), unless case distinction is made mandatory in RFC 3066
> > language tags.
>
> I will soon be building a parser that will distinguish them without any
> problem, regardless of case. If you want, I'll send it to you when I'm
> done. (It'll be part of the new version of LTag that will support RFC
> 3066bis, if and when that is approved.)
>
> > In that case, the Unicode 4 TUS specification that says that language
> > tags should be lowercased would be non conforming in the context of
> > RFC 3066 language tags where case distinction is important, as soon as
> > an optional script code can be used now as a subtag.
> >
> > If Unicode does not want to change the legacy use of lowercased ISO
> > 3166 country/region codes converted to lowercase, an exception could
> > be made so that the ISO 15924 script code will NOT be lowercased but
> > specified with its normative titlecased form.
>
> None of this is needed. Please find an example that conforms to the
> syntax of either RFC 3066 or RFC 3066bis that exhibits ambiguity. If
> you can find even one, I will humbly apologize.
>
> -Doug Ewell
> Fullerton, California
> http://users.adelphia.net/~dewell/
>



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT