L2/05-020R

Reasons for Enhancing RFC 3066

RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences.

RFC 3066 was adopted in January 2001. In the past four years issues with the design of language tags and the standards that underlie them have become apparent. A new Internet-Draft, draft-phillips-langtags, has been developed to address these concerns.

This draft proposes enhancements to RFC 3066. Because their names are unwieldy, Internet-Drafts that propose to obsolete an older standard are usually referred to using the old number plus the French suffix "bis" (which means second).

Because revisions to RFC 3066 have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement.

Problems

RFC 3066bis addresses a number of issues that implementers of language tags have faced in recent years:

The stability, accessibility, and ambiguity issues are crucial. Currently, because of changes in underlying ISO standards, a valid RFC 3066 language tag may become invalid (or have its meaning change) at a later date. With much of the world's computing infrastructure dependent on language tags, this is simply unacceptable: it invalidates content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever.

RFC 3066 Language Tags: A brief survey

Tags defined by RFC 3066 take two forms. Most tags are formed using an ISO 639-1 (two-letter) or ISO 639-2 (three letter) language tag, optionally followed by an ISO 3166 country code. Tags formed in this manner are not individually registered and anyone can use such a combination of codes to identify their language preferences or the language of some piece of content. Because this system allows a broad range of tags to be formed by reference to the underlying standards, these tags are referred to as generative in nature. The generative system is very powerful and allows content authors and others to form and use very expressive tags without the need to engage in a long and arduous registration process. Examples of such tags are:

While it is possible to generate tags that do not identify any likely real-world content, such as Aleut as used in Belgium, tags of this nature do not represent a serious problem. Consider the case of a database that can identify people by mode of transporation and commute distance. It is not a problem that one could compose a query for people who walk to work with a distance greater than 100 kilometers, even though it is extremely unlikely for any results to be returned.

There are more serious problems with the RFC 3066 definition of generative tags, however. The ISO 639 and ISO 3166 standards are not freely available and evolve over time. For example, ISO 3166 has withdrawn tags in the past and, worse, then reassigned them to a different country altogether. As a result it is difficult for implementers to obtain a correct list of codes and then ensure interoperability with other implementations of language tags.

The other way to form an RFC 3066 tag is via registration with IANA. Tags registered with IANA identify a specific language, dialect or variation. Unlike the generative tags, the registered values cannot be combined with other standard subtags to form additional tags that are more descriptive. Examples of such tags are:

Registration, besides being a long and arduous process, also presents a variety of problems for implementers. Although the tags are freely available, most implementations do not support these tags because they do not fit neatly into the generative system. Special logic is required to handle these tags, especially when performing language negotiation or fallback. In addition, many of the tags are deprecated because the registration process is less opaque and time-consuming than registering a language with ISO 639 MA/RA has historically been. Eventually ISO 639 does catch up and assign the language a code, resulting in overlapping tag choices. Implementations must also deal with the implications of multiple valid tags identifying what is essentially the same content.

But most problematic is the lack of a relationship to the generative mechanism. Since each variation of a tag must be separately registered, language variations with a broad range of valid uses require an enormous number of registrations. For example, there are 8 registrations to deal with minor spelling reforms in the German language and these registrations cover just three countries where German is commonly spoken--and no countries where it is not the majority language (including those where it is spoken widely). Variations in languages with a broader diffusion (such as Chinese) may require 20 or more registrations to gain full coverage, sometimes of important distinctions.

The registration process, in fact, was designed with the assumption that each language tag is atomic in nature; that one is not supposed to "look inside" a tag such as de-CH-1996 to find the region, for example. RFC 3066 describes a rudimentary matching scheme based on language ranges which brings the alleged opacity of the registered tags into question.

Solving the Problems

This specification addresses each of these issues with a simple, elegant design that is compatible with existing language tags and implementations.

This compatibility exists on several levels. All language tags, both generative and registered, that were valid under RFC 3066 are still valid under this specification. In addition, and very importantly, language tags that are newly defined by this specification are compatible with the ABNF syntax, matching, parsing, and other mechanisms defined by RFC 3066.

Thus for an implementation of RFC 3066, all of the new tags defined by this specification are still in the form of valid registered tags, and will simply be dealt with in whatever fashion the implementation used to handle future registrations, those that were added to the registry after the implementation was created. In other words, tags formed under this specification that are unfamiliar to RFC 3066 implementations will be treated by those implementations as if they were registered tags from a future version of the 3066 registry.

Subtags and the Registry

The largest change in the specification is that it modifies the structure of the language tag registry. Instead of having to obtain lists of codes from five separate external standards (not all of which are easily available), the IANA registry will maintain a comprehensive list of valid subtags that can be used in the generative mechanism in a machine-parseable text format. This registry will continue to track the existing core standards and will start with the current list of valid codes. As future codes are assigned, the IANA registry will be updated to reflect the changes.

Having a separate registry allows IANA language tags to resolve ambiguity and stability problems with the underlying standards. Language tags formed today will be guaranteed to maintain their validity and meaning essentially forever, something that is not true today.

In addition, switching to a subtag registry changes the nature of registrations themselves. Instead of registering complete tags and therefore potentially having to register a very large number of them (complicating life for implementers and discouraging support for the registry), a single subtag can be generatively combined to form many useful tags.

For example, one registered tag today is zh-Hans, which represents Chinese written in the Simplified Chinese script. Only this tag is valid under RFC 3066. Useful tags such as zh-Hans-SG (SG=Signapore) or zh-Hans-CN are not valid. By switching to a registry in which Hans is a registered subtag, any of these valid and useful tags can be formed generatively.

In addition, the subtag registry will encourage implementers to support registered items, since the subtags will fit the generative mechanism and exception handling code will no longer be necessary.

To prevent the IANA language registry filling up with deprecated entries, rules have also been introduced to curb harmful registrations that should be handled by the various ISO maintenance and registration authorities (such as ISO 639). The point is not to usurp the ISO standard's authority and expertise but to make these standards accessible in a coherent way for implementers of language tags.

The new structure and registry allows implementations to determine much more about tags, even in the absence of registry information. This is important because at any given point in time there will be a mixture of implementations that have different snapshots of the registry. The new structure allows these implementations to to interoperate effectively. In particular, the category of all subtags (as language, region, script, etc.) can be determined without reference to the particular version of the registry snapshot by the implementation. This allows for much more robust implementations, and greater compatibility over time.

In addition, this specification also makes it possible, for the first time, to effectively test whether an implementation conforms to the specification. The problem with RFC 3066 is that to determine the status of an implementation produced at a given point, one has to reconstruct the historical contents of each of the ISO standards and the historical contents of the registry. This is a time-consuming and error-prone process. The new registry provides a complete, easily parseable file which provides the precise the contents of valid tags for any point in time.

Additional Subtag Sources

This specification introduces two additional international standards as sources for language tags.

ISO 15924 represents script codes. (The example above of Hans is from ISO 15924.) Writing system variations are often crucial to communicate, especially when selecting content using language negotiation. Addition of this standard will allow these distinctions to be formed generatively, rather than via individual registration.

UN M.49 represents region and country codes. The UN M.49 standard is used by ISO 3166 to determine what a country is. The UN M.49 codes are used by this specification in two ways. First, if ISO 3166 reassigns a country code formerly associated with one country to another country (as it did in 2001 with the CS code, formerly Czechoslovakia and now assigned to Serbia and Montenegro), then the UN M.49 code can be placed in the registry to preserve stability. Secondly, the UN M.49 standard defines regional codes for areas such as Central and South America which can be useful in forming language tags for larger regions.

Future-Proofing: Private Use and Extensions

Because of the widespread use of language tags, it is potentially disruptive to have periodic revisions of the core specification, despite demonstrated need. This specification addresses this problem by fully specifying the valid syntax of language tags, while providing for future, unforeseen, requirements. One of these mechanisms is the extlang subtags, which allows for future extensions of ISO 639, in particular, the advent of ISO 639-3.

Private use subtags is another one of these mechanisms. In RFC 3066, any tag that was not registered or wholly made up of generative subtags must use a wholly private-use tag (such as x-en-US-myTag). Recipients of such a tag are not allowed to infer any information from such a tag, except by private agreement. Implementations cannot extract useful non-private values from such as tag. RFC 3066bis allows for private use subtags in a particular, prescribed manner.

Consider the IANA registered tag sl-nedis, which represents the Natisone dialect of Slovenian. The subtag sl is a valid ISO 639-1 code for Slovenian. Prior to its registration with IANA, if users wished to tag content as being in the Natisone dialect, they had two choices for language tags: sl and x-sl-nedis (or similar). The first tag does not meet the need of distinguishing the text from other varieties of Slovenian, while the second one does not convey the relationship to Slovenian to outside processors (a human might look at the tag and infer Slovenian, but the sl subtag doesn't necessarily represent that language).

Under this specification, if a new dialect of Slovenian were needed (let's call it the 'xyzzy' dialect), a tag such as sl-x-xyzzy can be used. In fact, a quite comprehensive amount of information can be communicated: sl-Latn-IT-x-xyzzy would represent Slovenian written using the Latin script as used in Italy with some additional private distinguishing information (which implementations of this specification can match algorithmically).

Note that RFC 3066 private use tags are still permitted and have the same information content and treatment as they did previously.

The extension mechanism also provides a way for independent RFCs to define extensions to language tags. These extensions have a very constrained, well-defined structure to prevent extensions from interfering with implementations of either RFC 3066 or of RFC 3066bis.

Matching and Language Negotiation

Content tagging is only one of the applications for language tags. The other major applications are querying for for matches and in content negotiation. RFC 3066 defines language ranges for use in content negotiation and querying and describes a very simple matching algorithm. RFC 3066bis maintains compatibility with this language negotiation scheme. A separate document (draft-phillips-langmatching) was created to deal with the topics of matching and ranges formerly contained in RFC 3066.

Well-Formed vs. Validating

Existing language tag processors already fall into two categories. There are language tag processors that check if language tags have the proper, well-formed, syntax, but which do not validate their content, and there are language tag processors that in addition validate and reject unrecognized tags. Each of these categories is appropriate to different implementations. For example, to process incoming tags that may have been formed under a future registry, an implementation may restrict itself to only checking well-formedness. Another implementation that allows users to generate tags may fully validate.

RFC 3066bis clearly distinguishes these two possible classes of conformance, and provides an explicit, testable definition of each one.

Impact of the New Design on Existing Implementations

One concern that is crucial to acceptance of the new language tag design is how it works with existing implementations of RFC 3066 and how existing implementations will interact with implementations of the newer language tags.

It is important to recognize that all language tags that were valid under the existing RFC 3066 will remain valid, with their meanings intact, under the new specification. In fact, RFC 3066bis stabilizes these meanings so that existing implementations can be continued forward for as long as necessary. Content, regardless of its format, will remain valid, essentially forever.

As content and systems begin to make use of the new language tags by adopting the additional fields defined by this specification, there will be an impact on software and systems that expect only the older tags. The design of this specification was carefully created so that all of the new values that can be assigned fit the pattern for registered language tags under RFC 3066. Thus while existing implementations will not recognize the meaning in the tags, they will be able to process them as if they were unrecognized-but-well-formed registered tags.

One concern that arose in discussions of the design of RFC 3066bis is the placement of the script subtag in the generative mechanism. Under the draft, the script subtag is placed between the primary language and region subtags. Some implementers have noted that existing implementations are sometimes built to rely on the region code being in the second position. Clearly the new format tags won't be compatible with these tag processors from the point of view that processors won't be able to find the region information in these tags.

The design chosen was a tradeoff between immediate compatibility with implementations that already do not recognize the tag versus the logical aspects of placing the script subtag in the second position. Generally when searching for content, the script is more tightly bound to the primary language than the region or variation of the language is. Language tags, by design, have their subtags in order of increasing specificity or granularity.

In addition, although the new draft on matching acknowledges the possibility of alternate or advanced matching and negotiation strategies, it maintains the existing matching algorithm (by removing subtags from the right side of a language tag until a match is obtained), simply providing more detail on usage.

Summary

The authors of this document (who are the authors of RFC 3066bis) have worked for the past eighteen months with a wide range of experts in the language tagging community to build consensus on a design for language tags that meets the needs and requirements of the user community. Language tags form a basic building block for natural language support in computer systems and content. The revision proposed in this specification addresses the needs of this community of users with a minimal impact on existing content and implementations, while providing a stable basis for future development, expansion, and improvement.