L2/01-215 From: Peter_Constable@sil.org Sent: Monday, May 21, 2001 11:55 AM Comments on L2/01-207 and on current work on language tagging in general (Follow up of 85-M1/M2 and UTC 86 Agenda Item B.5.5) First, I have definitely gotten the impression that requests from the US Gov't have had a significant factor in getting TC37 to pursue new work, and work that goes far beyond what they have done before. I think it is good that there is that interest and willingness. Of course, how productive and useful it turns out to be will be determined in the execution. Secondly, it has been apparent that some key people connected with TC 37 want to work with SIL and get some kind of connection established between ISO 639 and the Ethnologue, the latter being viewed as the best thing available in the way of a comprehensive list. Haavard sent me the Access DB he referred to in N835, and invited our comments. Gerhard Budin (chair of SC2) talked of the value of the Ethnologue at a meeting on metadata back in Jan or Feb not knowing that my colleague Gary Simons was there; they went on to have some discussions on interactions between the Ethnologue and ISO 639, with Gerhard indicating that he was willing to look at different options including wholesale adoption of the balance of the Ethnologue, or making SIL an approved regstration authority over a portion of a new namespace (a part 3 of ISO 639). This indicates two things to me: that people in TC 37 may be open to a truly comprehensive standard, and that they are open to making use of the Ethnologue, which seems to be a significant shift from earlier indicators I had received, and gives some common ground with at least some folk on UTC. Third, in spite of these positive indicators, it is not to be assumed that left to its own devices TC 37 would necessarily come up with exactly what all or even most of the stakeholders in this issue might need and want to see happen (though I'm not sure the stakeholders themselves have entirely figured out what they need). On the other hand, however, I have seen an openness from Gerhard and the relevant ANSI TAG to get all of the stakeholders at the same table. Fourth, it's my impression that Haavard would like to see the denotation of ISO LANGIDS be pinned down and documented better. I see this reflected in what he put into the DB, but also in his invitation for us to offer input. Of course, getting the meaning of something like "ar" nailed down will likely require the consensus of the WG rather than being something he can decide unilaterally (or, at least, I'm guessing that's the case). I don't see how they can possible move forward with any of the work items proposed in N835 without having taken that important step. (E.g. you can't really add a bunch of new things until it's clearer just what you already have.) In this latter regard, Gary Simons has been doing a bunch of research, and has created a SQL database with explicit mappings between ISO 639-x and the Ethnologue. He has the tables prepared for everything in ISO 639-1, and is preparing to send a report to Gerhard and Haavard probably this week. This represents comments of the sort Haavard requested (Gary used Haavard's info as a primary source, though I think there are some cases where he is proposing different decisions from what Haavard had made), but it goes somewhat farther. For one thing, it has an ASP interface and is able to respond with HTML reports to queries over the net (XML, obviously, also possible). Secondly, not only does it not only make indicate which Ethnologue code(s) a given ISO code corresponds to, but it also makes it clear when an ISO code is referring to a cluster of languages (not always evident from the specified name), and it also makes clear if there are cases of Ethnologue entries with names that are similar to the ISO names but which are *not* in fact part of the denotation of that ISO code. In other words, it has the potential to provide the explicit documentation needed by ISO 639-1 and -2 to make clear just what each of the 2- and 3-letter codes denotes, and thereby address one of the key problems Gary and I discussed in our IUC 17 paper. This work is important for us, whatever happens. It will be necessary to bring about something connecting ISO 639 and the Ethnologue, as Gerhard has suggested, but it will also be an essential step if RFC 3066 were to be extended by a mechanism to allow independent naming authoritites and we wanted to see SIL become one such naming authority (to maintiain backward compatibility with the existing Internet standard, an ISO code would have to be used in preference to an Ethnologue code for the same denotatum). So, things are progressing in this key regard, at least. Fifth, in N835, Haavard has suggested a program of work for coding of language variation. Depending on exactly what people are thinking of codifying, this is potentially dangerous territory -- dangerous in the sense of being rich in problems that could make success unlikely. For example, if someone wants to codify dialects for all the world's languages, or even all dialects of a few major languages, this is in principle not possible since there is no operational definition of dialect that can give any kind of objective results. The output of such an effort would be a hodgepodge of identifiers that are unclear as to their meaning and are used with considerable inconsistency. Similarly for genetic classifications (Haavard specifically mentions a "formalism to express the hierarchy of language families", by which I assume he must mean genetic classifications), there is a limited level of genetic depth at which linguists will have a fair amount of agreement for many language families, but at any significant depth in the family trees there is more often considerable disagreement, which is to be expected when linguists are proposing theories that are to some extent mere conjecture based on limited evidence -- there usually just isn't a lot of data available. In summary, there are many aspects of language variation in which a technical body such as TC 37 (or UTC) does not have the necessary expertise to make appropriate judgements, and many which in principle may not be amenable to comprehensive formalisation. This does not mean that certain aspects of language variation should not be explored by TC 37 or other technical bodies. There are some dimensions of variation in linguistic and paralinguistic (e.g. writing system) categories that are appropriate for formalising, and some that may not be easy to formalise but which industry does need to grapple with. A careful analysis is needed, however, to determine what the actual needs for information technology are, and what the best approach is to meeting each of those needs, given the nature of something as dynamic and variable as language. Before we try to codify "language variation", we need to be clear as to exactly what dimension of variation it is we're trying to codify, why we are doing it / what it needs to accomplish, and that there is an approach to the codification that will meet the IT needs and also can succeed. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: 3