Re: Questions: locales; CLDR process; ISO-639 (again)

From: Donald Z. Osborn (
Date: Wed Mar 01 2006 - 01:06:17 CST

  • Next message: Otto Stolz: "[almost OT] Music score with RTL lyrics"

    John, Thanks for this reply. I respond in text below...

    Quoting John Cowan <>:

    > Disclaimer: I speak only for myself, not for ISO, IETF, Unicode, or
    > any of their components.
    > Donald Z. Osborn scripsit:
    >> 2a. [...] my thought is tht in cases where the ISO-639-1 or 2
    >> coded language has variants in ISO/DIS-639-3 defined more or less by
    >> country,
    >> it makes sense to use the 1 or 2 code plus the country code rather
    >> than the 3
    >> code.
    > If Ethnologue divides them into separate languages, that means they are
    > more than just national variants, even if they happen to be separated
    > by a national border. In any case, it's a good bet that at least
    > some populations of speakers are on the "wrong" side of the border,
    > and will want to use their own variety of the language, but with the
    > cultural conventions (time zone, currency, whatever) of the country
    > where they reside. So I would recommend *against* using country codes
    > to discriminate between languages.

    We run into the issue of "what is a language?" (which we don't need to debate
    here other than noting that there are differences of opinion among experts),
    and more importantly what are the practical levels of distinction among
    different tongues (call them closely related languages [in a
    macrolanguage or a
    cluster], dialects or whatever) necessary for localization.

    One of the reasons *for* using country codes at some level is that for
    a number
    of the many crossborder languages (or macrolanguages) the orthographies
    are set
    by national authorities, and some vocabulary may differ based on colonial
    heritage (in Africa, borrowings from English or French, for instance). The
    latter may be accounted for by the language categories of Ethnologue (and
    ISO/DIS-639-3) but the former, in an environment where text is the main
    content, seems unavoidable.

    Also I note that the locale form needs language code and country code. Not
    trying to make arguments here, but to understand how best to use the
    system and
    all the various codes.

    (BTW, your turn of phrase "speakers are on the 'wrong' side of the border,"
    which I realize is just a turn of phrase, reminds me of one aspect of
    Ethologue's presentation that I am not fond of - in every case they seem
    obliged to say "x [language], a language of y [country]" when in so
    many cases,
    especially in Africa, it's unnecessary and misleading to try to put a language
    into such a box. But this is tangential to the issue here.)

    > Work on RFC 3066ter, which will incorporate ISO 639-3 tags, has not yet
    > formally begun. The intention of most of the various players, however, is
    > to use a design in which a language encompassed by a 639-3 macrolanguage
    > will have a two-part language subtag, of the form zh-yue (Cantonese).
    > So 639-3 code elements for languages that are *not* macrolanguages will
    > be added directly, but code elements like yue will not: yue will only
    > exist in Internet language tags as part of the compound subtag zh-yue.

    Thanks for this clarification. Actually the "nesting"of the '3 codes
    under a '1
    or a '2 code makes a lot of sense. Two questions:
    1) Can one file a locale before 3/15 using this format "ff-ffm-ML" even though
    the design is not yet oficial?
    2) If not, would this imply that it is better to make a locale for a
    variant of
    a "macrolanguage" using a '1 code orif not available, a '2? So: ff-ML and not
    ffm-ML? (leaving the refinements with the '3 codes until later?

    Beyond that I see that there may be a lot of discussion on the roles
    and use of
    the different codes in the case of different (macro)languages. In teh case of
    Arabic, for example, would a simple ar-EG be enough or would you need (or
    alternatively want to rule out) ar-arz-EG (arz=Egyptian spoken Arabic), while
    at the same time allowing perhaps that less widely spoken dialects in the
    country be noted?

    >> 2b. An example is Kpelle spoken in the Liberia-Guinea border area
    >> (it is also
    >> known as Guerze in Guinea). There is an ISO-639-2 code, "kpe," and separate
    >> ISO/DIS-639-3 codes for Kpelle of Liberia, "xpe," and Kpelle or Guerze of
    >> Guinea, "gkp."My thought is that "kpe-LR" & "kpe-GN" are preferable to "xpe"
    >> and "gkp" for locales.
    > The RFC 3066ter language tags will be (unless something changes radically)
    > kpe-xpe and kpe-gkp. The effect of this is that documents tagged with
    > either code will match an attempt to find "kpe" documents.

    Yes, this makes sense, and so by extension at least many other
    (ff for Fulfulde/Pulaar, Man for Manding - at least the western tongues, ...).

    But today, if we were filing two locales for Kpelle, what would be the best
    coding? I'm assuming that kpe-LR annd kpe-GN would be the best (or least bad)
    choices even if later the xpe and gkp have to be added?

    >> 2c. Part of this gets back to the definition of what is a language, but for
    >> purposes of software localization it may be simpler to go for the
    >> higher level
    >> of aggregation and distinguish by country (which it seem one has to
    >> do anyway).
    >> Even this may not be satisfactory in all cases as there are often
    >> significant
    >> dialect (or language) differences in a language (or "macrolanguage" in SIL's
    >> system) within a country.
    > For that case, RFC 3066bis (which is partly in effect now, though not
    > entirely)
    > provides machinery for adding subnational or non-national variety subtags:
    > en-gb-scouse, for example, is the Scouse (Merseyside) dialect of U.K.
    > English.

    So, we could use kpe-xpe and kpe-gkp or are kpe-xpe-LR and kpe-gkp-GN, however
    redundant, better?

    I need to backtrack here before moving on. When I think of an OpenOffice suite
    localized in Kpelle for example - even though I don't speak a word of it and
    know of no current effort to write a locale - I would thing that kpe by itself
    would suffice. Granted there are differences but in general I think that there
    will always be an effort to write the software for the highest level of
    aggregation, crossing borders and dialect (or language-within-macrolanguage)
    differences. What's true for FOSS is also true for MS (noting that for example
    an Inuktitut localization of Windows was conceived of for all variants).

    So another question (sorry these are accumulating) is what kpe-xpe-LR and
    kpe-gkp-GN locales would offer to a group localizing for Kpelle "kpe" as a
    transborder, multidialect (macro)language?

    >> 4. Going back to ISO-639 in general [...]
    >> What is the
    >> latest on all this?
    > I think, but I am not sure, that no new 639-1 codes can be added after
    > 639-3 goes into effect. (In principle, a language missed by 639-3 could
    > be added simultaneously to -1, -2, and -3, but the chance that such a
    > language both has been missed and meets the criteria for -1 is small.)
    > Any 639-3 language could be added to 639-2, using the same code element
    > for it in both parts of the standard.

    I'm thinking that language change, planning, and engineering would call
    for some
    flexibility on this. Dialect levelling, adoption of standard versions for
    literacy and instruction in schools, grouping of closely related
    tongues (as in
    the case of Runyakitara, which is designed for teaching but is not [yet?] a
    macrolanguage listing), and indeed localization efforts, all mean a shifting

    Add to that the facts that there are "clusters" of languages that are closely
    related but not identified as part of a larger grouping (macrolanguage) and
    that at least one agency, CASAS, is researching the bases for
    standardization /
    harmonization of some of these, and it would seem that the overall language
    situation is dynamic.

    I apologize for being so wordy, but there seem to be a lot of issues involved.
    Of the many questions, the urgent ones are those that would help in the
    of locales for a number of African languages in the next couple of weeks (!)

    Thanks again.


    This archive was generated by hypermail 2.1.5 : Wed Mar 01 2006 - 01:09:31 CST