Re: lowercased Unicode language tags ? (was:ISO 15924)

From: John Cowan (cowan@ccil.org)
Date: Mon May 03 2004 - 07:30:25 CDT


Philippe Verdy scripsit:

> The problem with language tags is that using ISO 3166 codes (which are
> defined by administrative divisions rather than by linguistic regions)
> is a workaround. For language tags, these instable administrative
> codes do not match well with language usage.

True enough. The main benefit (and this is IMHO, not official) is that
orthography often differs by nationality. en-uk primarily means not RP
and the other dialects of English used in the UK, but English written
with the orthography that is customary in the U.K.

I have available a first draft of xx-yy codes that I think are actually
useful, those representing languages that are official (or unofficial
but pervasive, like English in the U.S.) in more than one country
crossed with the countries. Thus en-us is useful in this sense, but
nv-dk is not (since Navajo is confined to the U.S) nor is en-dk (there
being no distinctively Danish orthography of English). This list does
not represent languages that cross national boundaries but are not
official/pervasive in at at least two of them, unfortunately.

(If anyone wants to review this list, please let me know!)

> For correct linguistic classification, it seems then that the Ethnologue
> classification would offer a better model, if it proposed a appropriate
> encoding and not only a classification by groups and names.

The Ethnologue language tags (which are also three-letter in form) are
being aligned with ISO 639-2 to remove all incompatibilities, and then
will be proposed as ISO 639-3 (leaving ISO 639-2 as a subset of them).
Presumably they will then be incorporated into some successor to
RFC 3066. In any case, there is no intention of classifying languages,
merely of encoding them, as it is far from clear that classifying
languages actually helps you much. Suppose you request a document in
the Irish language from some language-classification sensitive server:
the most nearly related language actually available might be Welsh, but
in practice falling back to English would be far better.

> So RFC 3066 language tags (not ISO 639 language _codes_) are for
> now a nightmare to handle, with the problem even more serious by
> the inclusion of ISO 3066 which was clearly not done for language
> classification but for administrative and legal usages...

In practice, two methods are found for handling RFC 3066 codes:
either treat them as opaque strings, or when you find a code you don't
understand, remove subtags from the right until you do understand it.

-- 
Si hoc legere scis, nimium eruditionis habes.


This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT