Re: Questions re ISO-639-1,2,3

From: JFC (Jefsey) Morfin (jefsey@jefsey.com)
Date: Tue Aug 23 2005 - 11:32:45 CDT

  • Next message: Neelesh Bodas: "Representing 'Halant R' in Marathi"

    Philippe,
    The problem in using alpha-3 codes is that they are 3 alpha long. An
    IETF Draft, supported by Doug and Peter, proposes a strict variation
    of the RFC 3066 ABNF (structured format) where subtags are partly
    identified by their size, partly by their relative position. I say
    "variation" because - however it includes some additions (which
    result in changes in the RFC 3066 ABNF) - it does not want to be an
    evolution which would permit much needed other changes (IMHO) and
    support innovation, for reasons I will not discuss here. The use of
    alpha-3 in that ABNF could be confusing at some stage with other
    information, all the more than in internet protocols one must not
    consider the case.

    This calls for several considerations:

    - this Draft wants to make this format the sole format to be used in
    the IANA registry. This worryingly leaves only two possibilities if
    you are not satisfied with that particular format: to defeat the
    Draft, or to build an open alternative to the IANA registry (I was
    engaged in also supporting the Draft ABNF as one of the deprecating
    propositions, and in working on the necessary distribution and
    extension of the IANA system)

    - the format lacks several important informations such as the
    referent of the language (is it English, Basic English, by which
    publisher, using which dictionary, etc.), the context of the exchange
    (style, special words, etc.) and the date of standard reference
    (which may not be the date of the document, which is often ignore anyway).

    - the format is supposed to be multimodal, but only limited script
    information (founts are not documented) are supported and no space is
    reserved for voice, signs, icons attributes.

    - but most of all this proposition does not consider the designated
    content in a network relational exchanges perspective. This is a very
    important point to designate a language. Languages have never been
    made to be identified but to be used. They have been made to permit
    face to face relations. They have been extended (distance, audience
    and time) through scripts. Today they are broadly extended by far
    more complex an evolution than from voice to script. Script have
    introduced memory and communication. Communication is totally changed
    today as is memory. Scripts are much more complex and changed. The
    introduction of the relational services changes the nature of the
    exchanges. The languages themselves change of nature as
    multilingualism extend the capability of language negotiation and
    adaptation, from language to language and therefore within what one
    understood as a same language. The number of terms to be used/known
    is drastically extended too and as a result leads to various views
    (and not version) of a language.

    Languages are brain to brain interintelligibility protocols. To want
    to describe the language and cultural evolution, which tries to
    support the increase of exchanges (number, density, complexity), with
    designations of the preceding language era (script), is awkward. It
    would be like trying to describe the internet in using a postal
    paradigm (I use this because this is, to date, unfortunately the main
    problem of the end to end interoperability layer). Like every
    protocol, languages have parameters. These parameters can include the
    country codes - the interest of a numeric code of some size is its
    stability, its multilingualism and its script independence.

    Another problem we face in trying to build informations databases
    rather than object database (I suggest you consider the ISO 11179
    effort - not the result but the area of concern in TC32) is the
    versatility of the content. We still live with the idea that we use
    "texts". We actually use "architexts" (what is going to produce the
    vision/version of the text we use, and more and more the interaction
    of our rendering tools). If you say you do not want to consider
    computer languages, as the IETF Draft does, you deprive yourself from
    the very HTML, XML etc. you want to document: it is an architext and
    uses computer [ASCII] language - bravo bisharat!). The same architext
    may include successive information related to several countries,
    regions, ethnolinguistic zones, etc.... and languages. They will have
    to be decoded by an OPES (open pluggable edge service) reader. The
    IETF Charter adequately quote the relation with the locale, but the
    locale itself is subject to a possibly complex, versatile and
    adaptative negotiation and to interrelation with the other systems
    the computer is related to.

    Trying to manage this information with script/text related concepts,
    even in overloading them with a lot of information, would be like
    wanting to run on an high-way with a bicycle.

    ISO 639 1, 2, 3 are not appropriate to support this. They are however
    all what we have, as long as ISO 639-6 is not available. ISO 3166 are
    not appropriate, it is however a localisation tool of interest as
    being the most used ISO standard. But others like ISO 3166-2, E.164,
    X.121, geographical coordinates, etc. are of use. What the IETF Draft
    should have provided was an ISO 3166 equivalent adapted to the
    Multilingual Internet. This work is still to be done: it has been
    unfortunately delayed (I started working on a Draft addressing the
    need 13 months ago), but at the same time the (sometimes hot) debate
    over the IETF Draft was not a complete waste as it gave some good experience.

    But we now have to leave the bicycle in peace and to look for some
    good Ferrari/Renault.

    jfc

    At 10:32 23/08/2005, Philippe Verdy wrote:
    >From: "Doug Ewell" <dewell@adelphia.net>
    >>ISO 3166-1 alpha-2 and alpha-3 code elements are almost identical in
    >>their stability (or lack thereof). I can find no instances in the
    >>31-year history of ISO 3166 where an alpha-3 code element was changed
    >>while the corresponding alpha-2 code was left unchanged. (If you can
    >>find one, please accept my apologies.)
    >
    >Yes alpha-3 codes can change for a country, but in fact alpha-3
    >codes have still not been reassigned to different countries, unlike
    >alpha-2 codes. So changes of alha-3 codes just changes the old
    >official code into an alias.
    >
    >For example ROM changed to ROU, but ROM was not reassigned to another country.
    >
    >The reassignments of alpha-2 codes to different countries is the
    >main problem for use in locale codes that require longer stability
    >than dated statistics.
    >
    >What this means is that the alpha-2 codes need to be dated to be
    >disambiguated.
    >
    >>The numeric code elements (henceforth "codes"), which are really UN
    >>codes rather than ISO codes
    >
    >That's what I said (UNSD means United Nations' Statistics Division
    >if this was not clear)
    >
    >>are usually considered more stable, but it
    >>depends on what kind of stability you are looking for. ISO alpha codes
    >>change when the name of a country changes (or whenever the country feels
    >>like changing it; see Romania). UN numeric codes change when the
    >>territory covered by the code changes. Normally the latter event is
    >>less frequent than the former, but the reverse can also happen; in 1993,
    >>the numeric code for Ethiopia changed from 230 to 231 (because of the
    >>loss of territory to Eritrea) while the alpha codes remained ET and ETH.
    >
    >OK, but 230 has *still* not been reassigned (it could easily, given
    >the much smaller encoding space for numeric codes which are
    >geographically structured), so it has become an alias for Ethiopia
    >(such alias would remain valid for references to documents speaking
    >about the country before the split, or composed with localization
    >meta-data; of course documents speaking about the country after the
    >split should use the new code, to avoid the ambiguity with Erithrea,
    >but this would not invalidate the past references; but this would be
    >true for any country code, including the CIO 3-letter country codes,
    >or other standards).
    >
    >My opinion is that the UNDS wants to keep the possibility to make
    >historical searches in its data, without mixing in the same result
    >list the statistics of unrelated countries or territories. This is
    >however less a problem for UN, given that statistics are necessarily
    >dated (this is not the case for many documents needing locale code
    >markup or meta-data).
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Aug 23 2005 - 21:49:30 CDT