OFF-TOPIC: Proposed Successor to RFC 3066 (language tags)

From: Addison Phillips [wM] (
Date: Wed Nov 19 2003 - 19:44:07 EST

  • Next message: Philippe Verdy: "Re: creating a test font w/ CJKV Extension B characters."

    Hi Philippe,

    Thanks for the note.

    The announcement here was purely informational. This is off-topic to this list and thus further comments really should be carried off to Cross posting with this list is a Bad Idea, IMHO. I have not cross posted this note to prevent any thread "over there" from escaping back to the Unicode list. I HAVE posted a response to your message to you privately, copy that list.

    Thanks again for the comments.


    Addison P. Phillips
    Director, Globalization Architecture
    webMethods | Delivering Global Business Visibility
    Chair, W3C Internationalization (I18N) Working Group
    Chair, W3C-I18N-WG, Web Services Task Force

    Internationalization is an architecture.
    It is not a feature.

    > -----Original Message-----
    > From: Philippe Verdy []
    > Sent: Wednesday, November 19, 2003 3:51 PM
    > To:
    > Cc:
    > Subject: Re: Proposed Successor to RFC 3066 (language tags)
    > From: Addison Phillips [wM]
    > > Please note that there is a discussion list for this topic at:
    > >
    > > While Mark and I welcome your comments here or privately, off-list, you
    > can best be
    > > a part of the discussion by joining that list. Join the list by
    > sending a
    > request email
    > > to:
    > I note that the language tags proposal includes the following EBNF
    > productions for extensions that may be padded after the language code,
    > script code, region code, variant code:
    > extensions = "-x" 1* ("-" key "=" value)
    > key = ALPHA *alphanum
    > value = 1* utf8uri
    > alphanum = (ALPHA / DIGIT)
    > utf8uri = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))
    > Under this new scheme, the following language tag may be valid:
    > "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
    > which here would mean: {
    > language="sr"; // Serbian
    > script="Latn"; // Latin
    > region="SP"; // Serbia-Montenegro
    > variant="2003";
    > extensions="-x"; {
    > href=""
    > version="1.0"
    > }
    > }
    > However the problem with that scheme is its new use of characters "%" and
    > "=". There are a lot of applications that where not expecting
    > something else
    > in this field than just alphanum and "-" or "_" or ".", so that
    > the language
    > tag could safely be used without specific escaping within URIs
    > (for example
    > in HTTP GET URLs) or as options of a MIME type (I take a sample
    > here, which
    > may not correspond to an existing option of the "text/plain" MIME type):
    > Content-Encoding: text/plain; charset=UTF-8;
    > lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0
    > This may break the compatiblity of some parsers if such "extended language
    > tags" are found there, as there are two "=" signs within the value of the
    > "lang=" option.
    > For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
    > through correctly, as the following would become possible and prone to
    > generate form data parsing errors:
    > http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-

    I think it's quite strange that these extensions have not used the existing
    restricted encoding set to encode them, instead on relying on "%" and "=".
    Why not using "_" instead of "=" and "." instead of "%", like this:
    (same meaning as the first example above).

    But at least this draft offers a good starting point to indicate locales
    more precisely.

    I fully support the new reference to the ISO-15924 standard for the script
    code, and for documenting the legal values of variant codes (either a year
    with possible era, or a registered tag), as well as clearly indicating that
    languages codes should be the shortest ISO-639 codes (is it true for a few
    legacy languages which previously were coded with 3 letters and upgraded to
    2-letter codes, until there was a policy not to do it anymore in the

    Where does it affect Unicode, I don't know, except in its possible normative
    data tables which may contain future language code conditions, or in
    Language tags inserted in the Unicode encoded texts. Does Unicode need these

    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 20:43:54 EST