Re: lowercased Unicode language tags ? (was:ISO 15924)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 03 2004 - 06:38:29 CDT


Did I mix Canaries and Baleares? I'll have to look again for the articles
related to Catalan, which spoke about its 4 main dialects. I was probably
remembering one being in the Canaries, but you may be right if this is really
the Baleares.
May I reformulate the examples ?

The problem with language tags is that using ISO 3166 codes (which are defined
by administrative divisions rather than by linguistic regions) is a workaround.
For language tags, these instable administrative codes do not match well with
language usage. And there is evidence that:
- administrative regions that have a code in ISO 3166 are covering several
linguistic regions which will still need their distinction in language tags.
- some linguistic regions cover several countries. The country distinction is
not much helpful when the real limits are linguistic areas. I spoke about
Catalan which is a good example where one of the 4 main dialects is spoken in 3
administrative countries (Spain, Andorra, France).

Having to use a country code after the main language code but before a region
code is a hack for separating those languages appropriately. Codes like "ca-ES",
"ca-AD", "ca-FR" will not be helpful to make the appropriate distinctions
between the 4 main dialects of Catalan.

I could say the same thing about the 4 main dialects of Breton, within the same
administrative region of France (Britanny), and where the other level of
encoding in ISO3166 is the numeric department: the four variants of Breton are
not correctly identified by the very administrative definition of French
departments (which have absolutely no sense as linguistic regions).

If you think about classifying the vrious dialects of languages in Borneo,
Africa, Mexico, Brasil or China, you'll find the same caveats: ISO 3166 is not
offering a correct way to encode linguistic regions, for use in RFC 3066
language tags...
So we are left at NOT using any RFC 3066 code, but to use specific language
codes for these variants. Shamely, those variants are not easy to group together
in softwares that will not consider the specific language variants and will
proposed a default "standard" form.

For correct linguistic classification, it seems then that the Ethnologue
classification would offer a better model, if it proposed a appropriate encoding
and not only a classification by groups and names.

So RFC 3066 language tags (not ISO 639 language _codes_) are for now a nightmare
to handle, with the problem even more serious by the inclusion of ISO 3066 which
was clearly not done for language classification but for administrative and
legal usages...



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT