RE: Questions re ISO-639-1,2,3

From: Donald Z. Osborn (dzo@bisharat.net)
Date: Tue Aug 23 2005 - 10:35:30 CDT

Next message: Mark Davis: "Re: ldml dtd"

Previous message: Alexej Kryukov: "Re: Historical Cyrillic in Unicode"
Maybe in reply to: Donald Z. Osborn: "RE: Questions re ISO-639-1,2,3"
Next in thread: Doug Ewell: "Re: Questions re ISO-639-1,2,3"
Reply: Doug Ewell: "Re: Questions re ISO-639-1,2,3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter, Thanks for your questions and thoughts. I reply in text, below...

Quoting Peter Constable <petercon@microsoft.com>:
> > From: Donald Z. Osborn [mailto:dzo@bisharat.net]
>
DZO> > 1) It is seen as convenient to have a one-stop site for
> > various information relevant to localization. ...
...
PC> I have no problem with citing ISO 639 IDs for particular languages: that
> is something that we expect to be stable. It's quite another thing,
> however, if we're talking about a general listing of language
> identifiers. For the latter, I feel that people should refer to
> definitive sources: the official source or an approved mirror.

I think we agree - I was thinking that perhaps an additional way for the
official sites to make the data available (dynamically) to other sites could be
useful.

> > 2) ..., but there are gaps
> > and there is no search feature for the codes.
>
> Could you give me an example of what you're referring to by "gaps"?

Missing ISO-639-1 codes. Maybe gap was not the right word.

> A UI for searching could certainly be considered. I suspect that the RA
> for 639-3 will be open to considering additional UI features to meet
> user needs.

Sounds like a good idea. What about a collaboration with the RA for ISO-639-1&2
(the LOC) on a definitive site for all three listings? (Regardless of whether
anything comes of the data feed idea.) From a localization point of view it
would make more sense than to have two separate authoritative sites.

> > 3) ISO-639 data fed from the official sites could facilitate
> > devising a kind of relational database linking it to alternate
> > names for languages and perhaps groupings of languages.
>
> It would seem to me that such a relational database will either rely on
> an internal data table, in which case a downloadable file is what is
> wanted, or URIs pointing to record for particular languages that are
> available on the Internet. If that's what you're referring to, then the
> ISO 639-3 RA will provide that. But that does not imply a need for other
> sites to present mirrors or duplicates of the 639 code tables.

I was thinking - and again this is extrapolating maybe too far from a single
thought - that there might be diverse ways to present the data, that is the
codes of ISO-639, and that others might see a need for doing so on their own
sites. In such a case the issue would be assuring that the codes are accurate
and always up to date. We may not see the need, and indeed maybe no one else
would, but there apparently are people interested in presenting duplicate
tables/lists for their own reasons, and from among them maybe some would work
the data into more dynamic presentations of their own.

> > 3a) Say you were looking for the code for Pulaar...
>
> I think the idea of a search facility that can use alternate names such
> as those listed in Ethnologue would be a great idea. Of course, if the
> list of alternate names is incomplete, the user won't find what they're
> looking for, and it's unlikely a list could ever be complete.

Agreed. The issue is not being exhaustive as much as being more complete. A
"sounds like" search facility would be ideal for African languages, given that
there are so often variant spellings.

> > they would come up with an ISO-639-3 code for Pulaar but still be
> ignorant
> > of
> > the ISO-639-1&2 codes for "Fulah/Peul" that might actually serve the
> > purpose
> > intended by the user.
>
> If you're referring to the online version of Ethnologue, I suspect that
> once the 639-3 is launched the language descriptions on the Ethnologue
> site will provide hotlinks to information for that language on the ISO
> 639-3 site (and that would include association with macrolanguage
> categories).

I meant the SIL site on the ISO/DIS-639 (perhaps we're writing of the same thing
but the latter is of course at sil.org, not ethnologue.com). And actually I'm
imagining a simpler case of a localizer or webmaster looking for the right
language tag (in my hypothetical example, looking first at the authoritative
site of the RA for ISO-639-2 (and 1, so it seems), loc.gov, and then at the
sil.org site. If the latter has these features, then great, but the former
ought to have them too, at least from the localization / local content point of
view. (And if they don't then there's a need that someone will try to
address.)

> > or better yet provide accurate raw data feed
>
> So, what do you consider a "raw data feed"? A URL of the form
>
> http://www.sil.org/iso639%2D3/documentation.asp?id=aaa
>
> will return data pertaining to the identifier "aaa". That particular URL
> will return data in HTML format; I'm guessing by "raw data" you want
> something other than HTML. You want plain text? Some kind of XML record?
> Such things certain can be considered, though I can't speak for the
> 639-3 RA regarding their openness to doing that.

Frankly I'm hazy on this, but thought that at least a tab- or
semicolon-delimited plain text file with the current/updated codes and names
(all) could be released regularly (even when there are no changes). Like you
describe, I think.

> > maybe the ISO-639 lists such as they are will need some sort of
> > revisions at some point with respect to what languages (dialects)
> > are represented at the "language" and macrolanguage levels, and
> > what the relationship among them is.
> > The example of Fula/Peul and its variant forms that I mentioned
> > above is an interesting case in point - the fundamental unity
> > and evident diversity of the language(s) are such that one could
> > imagine the utility of tagging Pulaar as ff-fuc - that is
> > Fula-Pulaar, using ISO-639-1 (always the preference over
> > ISO-629-2 where there are both, as I understand it from the W3C
> > site) and ISO/DIS-639-3, though such nesting of ISO-639-3 I
> > understand not to be intended. Further specification by country
> > code would be helpful since the orthography in Senegal varies
> > slightly from that in neighboring Mali and perhaps Mauritania.
>
> It's not clear to me what you feel is lacking here. The 639-3 site will
> tell you that the category "ff"/"ful" is a macrolanguage, and what is
> the list of its encompassed individual languages, which list will
> include the category "fuc". The record for "fuc" will document its
> properties as defined in ISO 639-3 and will also include links to
> external resources such as Ethnologue that will document its denotation
> more fully. If there is more you think is required, please clarify.

The issue goes deeper, which is potentially problematic, but which also may
counsel us to be flexible in defining some terms. For some, macro-language and
language are language and dialect. This is not the place to discuss the theory
(which as a non-linguist I'd be at a disadvantage anyway) but to recognize that
there are also ideological & identity issues involved, and on a practical
level, "languages" that are close enough that for localization purposes (and
developing locales) one might treat them together under a "macrolanguage"
label. Or even if separate localizations are necessary, that some way of
identifying diverse pages as varieties of a same language might be necessary.
This is a common issue in the case of African languages - not just Fula - from
my limited personal experience and what I've learned otherwise. (In the case of
Fula I've seen for instance speakers of ff-fuc-MR speak with speakers of
ff-fuq-NE; I've personally made the transition from ff-ffm-ML to ff-fuf-GN, and
was glad I learned in that order since the latter is a bit of an outlier - a
native ff-fuf-GW speaker I knew had real trouble with ff-fuh-NE, which I could
handle with no particular problem. So on one level it's more or less all ff [or
ful, the ISO-639-2 version], and on another level [texts perhaps] the divisions
are significant, but one cannot assume that the language[s] should always in
all situations be treated on the macro or individual levels.) So I guess I'm
saying that the presentation and definition of the codes should accommodate
these coexisting realities. More below.

> > Anyway, these are clearly not easy decisions and I know that in the
> > interests of
> > "stability" one can't go about undoing and renaming existing codes.
> But
> > these
> > are matters that will likely prompt (provoke?) more discussion as
> various
> > users, webmasters, and localizers come into contact with and attempt
> to
> > use the
> > standard (lang tagging web content; localization) for languages
> currently
> > less-represented in computing and cyberspace.
>
> !!! How did we suddenly go from providing a raw data feed to questions
> of choices of IDs for particular languages?

The questions are more of choices of languages for IDs, as it were. And they
arise (at least in my mind) when I spend a little time going through the list.
Its hard to

Another West African case comes to mind: Manding or "Mande core" languages of
Mandinka, Maninka, Bamanan, and Jula. Currently there is a macrolanguage
Mandingo defined in ISO-639-2 (man) as including several varieties of Mandinka
and Maninka. It might just as well also include other closely related Manding
tongues, notably Bamanan (bm/bam/bam) and Jula (dyu/dyu). In any event, the
latter two are close enough that the subject of a common localization for them
is being discussed - but no macrolanguage tag is available (though bm might be
used). Also the whole N'ko concept is predicated in large part on the unity of
Manding peoples.

It seems that the ISO-639-1 codes were assembled first with a limited purpose
and then added to without a clear plan (please correct if this impression is
off). Then ISO-639-2 was added and now the new proposed standard. Each with its
own purpose. Of the three, ISO/DIS-639-3 seems to reflect the clearest and most
systematic methodology, but most would see that as a "splitter" approach. Which
isn't "wrong" but it is one perspective and may leave people who seek a
"joined" solution looking to the earlier standards (1&2) for an acceptable
tag.

> > Thanks for any feedback. (One logical suggestion is that this go to
> the
> > ISO-639
> > list - perhaps someone could forward it there and I guess I'll have to
> > subscribe.)
>
> We need to watch that this doesn't go too far off topic for the lists to
> which this is addressed.

I hear you ...

Don

Don Osborn, Ph.D. dzo@bisharat.net
*Bisharat! A language, technology & development initiative
*Bisharat! Initiative langues - technologie - développement
http://www.bisharat.net

Next message: Mark Davis: "Re: ldml dtd"
Previous message: Alexej Kryukov: "Re: Historical Cyrillic in Unicode"
Maybe in reply to: Donald Z. Osborn: "RE: Questions re ISO-639-1,2,3"
Next in thread: Doug Ewell: "Re: Questions re ISO-639-1,2,3"
Reply: Doug Ewell: "Re: Questions re ISO-639-1,2,3"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Aug 23 2005 - 10:36:04 CDT