Re: Support of ISO 639 (was: Survey Tool pre-alpha)

From: verdy_p (
Date: Wed Nov 26 2008 - 13:09:39 CST

  • Next message: verdy_p: "Re: Why people still want to encode precomposed letters"

    "Roozbeh Pournader" wrote:
    > Two short comments:
    > * I don't think we should have a fixed set of languages for all locales.
    > One need to translate neighboring languages for each locale. For
    > example, I believe we need to provide translations for Mazandarani, a
    > local language of Iran, in the Persian locale, but not for the Japanese
    > locale. This means that the Survey tool should have some way of adding a
    > new code-name pair to a locale, for example.
    > * Generally, we should be careful about ISO 639-3 "leaks" everywhere in
    > CLDR. I really believe that every time we add some ISO 639-3 related
    > data to CLDR, we should ask ourselves "Am I sure? Isn't there a way to
    > do this the old way?"

    If only we could have some access to ISO 639-5 data (for managing the language families instead of using the
    historic and bdly designed language collections of ISO 639-1 (code [bi] only) and ISO 639-2...

    Also I'm still waiting to see how ISO 639-5 can be integrated with the RFC 4545bis and RFC 4646bis rules.

    I have the same problem for managing and trying to organize collections of languages in Wiktionary, and try to
    disambiguate those languages.

    I've been able to do this work for about 800 languages: all ISO 639-1, all ISO 639-2, but still problems with about
    20-30 languages of ISO 639-3 that are already used and referenced, due to lack of information about ISO 639-5.

    Of course the linked references to The Ethnologue from the ISO 639-3/MA ( can help, but it's still not
    easily accessible, and The Ethologue still uses its own local (and unstable) numeric identifiers for its language

    The issues found in are very similar to those that the CLDR project has to manage with multiple
    locales: it has to deal with language collections, macrolanguages, individual languages, and their varieties
    (multiple scripts or orthographies, regional dialects...) and find mappings from its own language codes (based on
    ISO 639) and Wikimedia's own "interwiki" codes that are not always equivalent or that are grouping several
    languages on a single project database.

    I would be very pleased to get some data showing how ISO 639-3 codes (or their documented equivalents in ISO 639-1
    or ISO 639-2) are grouped in families, and if the ISO 639-2 collections are kept in ISO 639-5.

    Anyway, the publication of ISO 639-3 (and its well-defined mapping to ISO 639-1/2 for individual languages and
    macrolanguages only) is a very important step that has helped solve many ambiguities (removing the need of many
    localy invented codes or extensions).

    I hope that ISO 639-5 will do the dame for mapping ISO 639-2 collections (plus the ISO 639-1 [bi] code for Bihari
    which is the only existing collection in that part of the standard).

    Finally I would be pleased to see what ISO 639-6 wants to do and will work with orthographic varieties and dialects
    (below the level of individual language).

    There are some pages (with sortable tables) that I created based on ISO 639-3 on French
    * (complete code set, except special codes)
    * (complete code set, except special codes and those that are
    defined in part 1)
    * (it will never be complete under this form as a single
    table, due to the huge number of names that it would have to support, but it is progressing, it won't contain any
    language with codes assigned in parts 1 or 2)

    The tables above for ISO 639-1 and ISO 639-2, on the opposite, are complete and give the names exactly like they
    are published by ISO 639-1/MA and ISO 639-2/MA, and lots of discussions on Wiktionnary or elsewhere have driven to
    the choice of the prefered name (sometimes different from all French names suggested by the ISO 639 standard).

    The French names for the ISO 639-3 table above are not authoritative (they don't exist or are not published by ISO
    639-3/MA). It merely collects some names that were contributed by various people or through various Internet and
    book searches, the "default name" is also not definitive (corrections are expected there). There are certainly some
    typos remaining (and some synonyms to remove as they were imported too much directly from English) but that's
    something that you won't find, currently on the ISO 639-3/MA website.

    When I'll have some time, I'll add more information in these tables (notably the code mapping to other parts of the
    ISO 639 standard).

    But from what I have seen, it is NOT reasonable to think that any ISO 639-3 code used in CLDR is a "leakage". These
    are codes effectively needed for modern use, and not mergeable into codes of parts 1 or 2 of ISO 639 (and certainly
    not into subcodes of codes defined as undefined "collections" in ISO 639-2, given that most of these codes should
    be completely deprecated).

    The kind of thing that CLDR (along with RFC 4645bis / RFC 4646bis) will better have to work on is to integrate ISO
    639-5 (and drop ISO 639-2 collections). There's nothing to drop for now from ISO 639-3 (except those codes that
    have been retired from the early initial publications of ISO 639-3, like the code for "Souletin" merged into the
    code for the "Basque" individual language, or codes that were split). Support for ISO 639-6 (varieties and
    dialects) will come after this, but but an early policy shuold be adopted about the form of the codes that will be
    assigned and that other systems need to reserve for this use, and how these codes will be allocated and organized
    (using common prefixes associatable directly to their parent individual language or using a mapping table?)


    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2008 - 13:13:09 CST