Re: CLDR and locale designations (was: [OT] Even viruses are now i18n!)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 23 2004 - 10:49:04 EDT

  • Next message: Edward H. Trager: "Re: Unihan.txt and the four dictionary sorting algorithm"

    From: "Antoine Leca" <Antoine10646@leca-marti.org>
    > > Never forget that language codes and country/territory codes are
    > different...
    >
    > We were speaking about ccTLD. A different beast. Try to resolve ANYTHING.GB.
    > on a root server, or alternatively to seek UK in ISO 3166, to understand
    > what I mean.

    I'm not speaking about ccTLD too... but a domain name ending in .gb or .fx could
    be valid if there's some DNS record with them. ccTLDs inherit from lagacy
    assignments by IANA, but even today, the IANA and RIR databases contain
    references to both the GB and UK country/territory codes.

    Look precisely into ISO 3166, and you'll see that both [UK] and [GB] are
    reserved even if only GB is assigned. You'll see other entries used by ITU (such
    as [EA] for Ceuta and Mellila, two small Spanish dependencies in Morrocco, with
    a status similar to Gibraltar, a British dependancy in Morocco which has an
    assignment in ISO 3166; look also for [DG] which is used by ITU for Diego
    Garcia, despite it is part of the British Territories in the Indian Ocean with
    ISO 3166 code [IO])

    ISO 3166 has its imperfections, but at least it contains enough references to
    reserve all codes used in IANA and ccTLDs, but also for some non-territory codes
    used for groups of countries in WIPU...

    Now when you see that softwares actually rarely need country/territory codes for
    their internationalization, but rather would need some code to differentiate
    scripts and script variants (such as between Latin and Cyrillic Serbian, or
    between Traditional and Simplified Chinese, and you'll see the caveats
    introduced in internationalized softwares when one needs to set its locale code
    to zh_TW to refer to Traditional Chinese, even if this is needed to address
    language variants used in other areas than Taiwan). Which code must be used to
    create resources in Serbian Cyrillic? [sh_YU], [sh_CS], [sr_CS] ? How can we
    avoid the confusion with Latin script versions?

    In fact the problem is not in ISO 3166, but in ISO 3066 for the designation of
    locales. This comes from imperfections in the ISO 639 standard, which has lots
    of difficulties to encode languages... And even more when it needs to make
    distinctions between languages written with several scripts (thanks now we have
    codes for scripts, maintained by Unicode, but there's currently no support for
    them in locale identifiers...)

    Country/territory codes are too much instable to correctly tag the language used
    in documents and applications, but the combination of ISO 639 and 3166 is for
    now the only widely supported alternative. So within locales, the ISO 3166
    country/territory code has lost its initial function to designate a territory.
    Instead it designates some language variants.

    I Also think about the case of Norwegian [no] which has two major variants:
    Bokmål for the traditional "book" orthograph and Nynorsk for the reformed "new"
    language; in ISO 639 we find new codes [nn] for Nynorsk and [nb] for Bokmål.
    Imagine the complication for softwares that should run with a Norwegian UI.
    Which code should be used?

    We also find [ax] for the Åland variant of Swedish spoken in Åland islands [AX]
    a dependancy of Finland [FI]. Some softwares assume incorrectly that this
    language is Finnish when it is in fact a variant of Swedish [sv]. Should
    softwares use [sv] or [ax]? Some softwares have chosen to use [sv_FI] to refer
    to the Åland language, because it is really the Swedish language spoken in a
    part of Finland.... How can those rules be infered in a locale-aware software or
    system?



    This archive was generated by hypermail 2.1.5 : Fri Apr 23 2004 - 11:38:35 EDT