Re: Proposed Successor to RFC 3066 (language tags)

From: Philippe Verdy (
Date: Wed Nov 19 2003 - 18:51:03 EST

  • Next message: Frank Yung-Fong Tang: "Re: creating a test font w/ CJKV Extension B characters."

    From: Addison Phillips [wM]
    > Please note that there is a discussion list for this topic at:
    > While Mark and I welcome your comments here or privately, off-list, you
    can best be
    > a part of the discussion by joining that list. Join the list by sending a
    request email
    > to:

    I note that the language tags proposal includes the following EBNF
    productions for extensions that may be padded after the language code,
    script code, region code, variant code:

    extensions = "-x" 1* ("-" key "=" value)
    key = ALPHA *alphanum
    value = 1* utf8uri
    alphanum = (ALPHA / DIGIT)
    utf8uri = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))

    Under this new scheme, the following language tag may be valid:
    which here would mean: {
        language="sr"; // Serbian
        script="Latn"; // Latin
        region="SP"; // Serbia-Montenegro
        extensions="-x"; {

    However the problem with that scheme is its new use of characters "%" and
    "=". There are a lot of applications that where not expecting something else
    in this field than just alphanum and "-" or "_" or ".", so that the language
    tag could safely be used without specific escaping within URIs (for example
    in HTTP GET URLs) or as options of a MIME type (I take a sample here, which
    may not correspond to an existing option of the "text/plain" MIME type):

    Content-Encoding: text/plain; charset=UTF-8;

    This may break the compatiblity of some parsers if such "extended language
    tags" are found there, as there are two "=" signs within the value of the
    "lang=" option.

    For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
    through correctly, as the following would become possible and prone to
    generate form data parsing errors:


    I think it's quite strange that these extensions have not used the existing
    restricted encoding set to encode them, instead on relying on "%" and "=".
    Why not using "_" instead of "=" and "." instead of "%", like this:
    (same meaning as the first example above).

    But at least this draft offers a good starting point to indicate locales
    more precisely.

    I fully support the new reference to the ISO-15924 standard for the script
    code, and for documenting the legal values of variant codes (either a year
    with possible era, or a registered tag), as well as clearly indicating that
    languages codes should be the shortest ISO-639 codes (is it true for a few
    legacy languages which previously were coded with 3 letters and upgraded to
    2-letter codes, until there was a policy not to do it anymore in the

    Where does it affect Unicode, I don't know, except in its possible normative
    data tables which may contain future language code conditions, or in
    Language tags inserted in the Unicode encoded texts. Does Unicode need these

    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 19:42:04 EST