RE: Questions re ISO-639-1,2,3

From: Donald Z. Osborn (dzo@bisharat.net)
Date: Tue Aug 23 2005 - 10:35:30 CDT

  • Next message: Mark Davis: "Re: ldml dtd"

    Peter, Thanks for your questions and thoughts. I reply in text, below...

    Quoting Peter Constable <petercon@microsoft.com>:
    > > From: Donald Z. Osborn [mailto:dzo@bisharat.net]
    >
    DZO> > 1) It is seen as convenient to have a one-stop site for
    > > various information relevant to localization. ...
    ...
    PC> I have no problem with citing ISO 639 IDs for particular languages: that
    > is something that we expect to be stable. It's quite another thing,
    > however, if we're talking about a general listing of language
    > identifiers. For the latter, I feel that people should refer to
    > definitive sources: the official source or an approved mirror.

    I think we agree - I was thinking that perhaps an additional way for the
    official sites to make the data available (dynamically) to other sites could be
    useful.

    > > 2) ..., but there are gaps
    > > and there is no search feature for the codes.
    >
    > Could you give me an example of what you're referring to by "gaps"?

    Missing ISO-639-1 codes. Maybe gap was not the right word.
     
    > A UI for searching could certainly be considered. I suspect that the RA
    > for 639-3 will be open to considering additional UI features to meet
    > user needs.

    Sounds like a good idea. What about a collaboration with the RA for ISO-639-1&2
    (the LOC) on a definitive site for all three listings? (Regardless of whether
    anything comes of the data feed idea.) From a localization point of view it
    would make more sense than to have two separate authoritative sites.

    > > 3) ISO-639 data fed from the official sites could facilitate
    > > devising a kind of relational database linking it to alternate
    > > names for languages and perhaps groupings of languages.
    >
    > It would seem to me that such a relational database will either rely on
    > an internal data table, in which case a downloadable file is what is
    > wanted, or URIs pointing to record for particular languages that are
    > available on the Internet. If that's what you're referring to, then the
    > ISO 639-3 RA will provide that. But that does not imply a need for other
    > sites to present mirrors or duplicates of the 639 code tables.

    I was thinking - and again this is extrapolating maybe too far from a single
    thought - that there might be diverse ways to present the data, that is the
    codes of ISO-639, and that others might see a need for doing so on their own
    sites. In such a case the issue would be assuring that the codes are accurate
    and always up to date. We may not see the need, and indeed maybe no one else
    would, but there apparently are people interested in presenting duplicate
    tables/lists for their own reasons, and from among them maybe some would work
    the data into more dynamic presentations of their own.

    > > 3a) Say you were looking for the code for Pulaar...
    >
    > I think the idea of a search facility that can use alternate names such
    > as those listed in Ethnologue would be a great idea. Of course, if the
    > list of alternate names is incomplete, the user won't find what they're
    > looking for, and it's unlikely a list could ever be complete.

    Agreed. The issue is not being exhaustive as much as being more complete. A
    "sounds like" search facility would be ideal for African languages, given that
    there are so often variant spellings.

    > > they would come up with an ISO-639-3 code for Pulaar but still be
    > ignorant
    > > of
    > > the ISO-639-1&2 codes for "Fulah/Peul" that might actually serve the
    > > purpose
    > > intended by the user.
    >
    > If you're referring to the online version of Ethnologue, I suspect that
    > once the 639-3 is launched the language descriptions on the Ethnologue
    > site will provide hotlinks to information for that language on the ISO
    > 639-3 site (and that would include association with macrolanguage
    > categories).

    I meant the SIL site on the ISO/DIS-639 (perhaps we're writing of the same thing
    but the latter is of course at sil.org, not ethnologue.com). And actually I'm
    imagining a simpler case of a localizer or webmaster looking for the right
    language tag (in my hypothetical example, looking first at the authoritative
    site of the RA for ISO-639-2 (and 1, so it seems), loc.gov, and then at the
    sil.org site. If the latter has these features, then great, but the former
    ought to have them too, at least from the localization / local content point of
    view. (And if they don't then there's a need that someone will try to
    address.)

    > > or better yet provide accurate raw data feed
    >
    > So, what do you consider a "raw data feed"? A URL of the form
    >
    > http://www.sil.org/iso639%2D3/documentation.asp?id=aaa
    >
    > will return data pertaining to the identifier "aaa". That particular URL
    > will return data in HTML format; I'm guessing by "raw data" you want
    > something other than HTML. You want plain text? Some kind of XML record?
    > Such things certain can be considered, though I can't speak for the
    > 639-3 RA regarding their openness to doing that.

    Frankly I'm hazy on this, but thought that at least a tab- or
    semicolon-delimited plain text file with the current/updated codes and names
    (all) could be released regularly (even when there are no changes). Like you
    describe, I think.
      
    > > maybe the ISO-639 lists such as they are will need some sort of
    > > revisions at some point with respect to what languages (dialects)
    > > are represented at the "language" and macrolanguage levels, and
    > > what the relationship among them is.
    > > The example of Fula/Peul and its variant forms that I mentioned
    > > above is an interesting case in point - the fundamental unity
    > > and evident diversity of the language(s) are such that one could
    > > imagine the utility of tagging Pulaar as ff-fuc - that is
    > > Fula-Pulaar, using ISO-639-1 (always the preference over
    > > ISO-629-2 where there are both, as I understand it from the W3C
    > > site) and ISO/DIS-639-3, though such nesting of ISO-639-3 I
    > > understand not to be intended. Further specification by country
    > > code would be helpful since the orthography in Senegal varies
    > > slightly from that in neighboring Mali and perhaps Mauritania.
    >
    > It's not clear to me what you feel is lacking here. The 639-3 site will
    > tell you that the category "ff"/"ful" is a macrolanguage, and what is
    > the list of its encompassed individual languages, which list will
    > include the category "fuc". The record for "fuc" will document its
    > properties as defined in ISO 639-3 and will also include links to
    > external resources such as Ethnologue that will document its denotation
    > more fully. If there is more you think is required, please clarify.

    The issue goes deeper, which is potentially problematic, but which also may
    counsel us to be flexible in defining some terms. For some, macro-language and
    language are language and dialect. This is not the place to discuss the theory
    (which as a non-linguist I'd be at a disadvantage anyway) but to recognize that
    there are also ideological & identity issues involved, and on a practical
    level, "languages" that are close enough that for localization purposes (and
    developing locales) one might treat them together under a "macrolanguage"
    label. Or even if separate localizations are necessary, that some way of
    identifying diverse pages as varieties of a same language might be necessary.
    This is a common issue in the case of African languages - not just Fula - from
    my limited personal experience and what I've learned otherwise. (In the case of
    Fula I've seen for instance speakers of ff-fuc-MR speak with speakers of
    ff-fuq-NE; I've personally made the transition from ff-ffm-ML to ff-fuf-GN, and
    was glad I learned in that order since the latter is a bit of an outlier - a
    native ff-fuf-GW speaker I knew had real trouble with ff-fuh-NE, which I could
    handle with no particular problem. So on one level it's more or less all ff [or
    ful, the ISO-639-2 version], and on another level [texts perhaps] the divisions
    are significant, but one cannot assume that the language[s] should always in
    all situations be treated on the macro or individual levels.) So I guess I'm
    saying that the presentation and definition of the codes should accommodate
    these coexisting realities. More below.

    > > Anyway, these are clearly not easy decisions and I know that in the
    > > interests of
    > > "stability" one can't go about undoing and renaming existing codes.
    > But
    > > these
    > > are matters that will likely prompt (provoke?) more discussion as
    > various
    > > users, webmasters, and localizers come into contact with and attempt
    > to
    > > use the
    > > standard (lang tagging web content; localization) for languages
    > currently
    > > less-represented in computing and cyberspace.
    >
    > !!! How did we suddenly go from providing a raw data feed to questions
    > of choices of IDs for particular languages?

    The questions are more of choices of languages for IDs, as it were. And they
    arise (at least in my mind) when I spend a little time going through the list.
    Its hard to

    Another West African case comes to mind: Manding or "Mande core" languages of
    Mandinka, Maninka, Bamanan, and Jula. Currently there is a macrolanguage
    Mandingo defined in ISO-639-2 (man) as including several varieties of Mandinka
    and Maninka. It might just as well also include other closely related Manding
    tongues, notably Bamanan (bm/bam/bam) and Jula (dyu/dyu). In any event, the
    latter two are close enough that the subject of a common localization for them
    is being discussed - but no macrolanguage tag is available (though bm might be
    used). Also the whole N'ko concept is predicated in large part on the unity of
    Manding peoples.

    It seems that the ISO-639-1 codes were assembled first with a limited purpose
    and then added to without a clear plan (please correct if this impression is
    off). Then ISO-639-2 was added and now the new proposed standard. Each with its
    own purpose. Of the three, ISO/DIS-639-3 seems to reflect the clearest and most
    systematic methodology, but most would see that as a "splitter" approach. Which
    isn't "wrong" but it is one perspective and may leave people who seek a
    "joined" solution looking to the earlier standards (1&2) for an acceptable
    tag.

    > > Thanks for any feedback. (One logical suggestion is that this go to
    > the
    > > ISO-639
    > > list - perhaps someone could forward it there and I guess I'll have to
    > > subscribe.)
    >
    > We need to watch that this doesn't go too far off topic for the lists to
    > which this is addressed.

    I hear you ...
     
    Don

    Don Osborn, Ph.D. dzo@bisharat.net
    *Bisharat! A language, technology & development initiative
    *Bisharat! Initiative langues - technologie - développement
    http://www.bisharat.net



    This archive was generated by hypermail 2.1.5 : Tue Aug 23 2005 - 10:36:04 CDT