L2/01-215

From: Peter_Constable@sil.org
Sent: Monday, May 21, 2001 11:55 AM


Comments on L2/01-207 and on current work on language tagging in general
(Follow up of 85-M1/M2 and UTC 86 Agenda Item B.5.5)

First, I have definitely gotten the impression that requests from the US
Gov't have had a significant factor in getting TC37 to pursue new work, and
work that goes far beyond what they have done before. I think it is good
that there is that interest and willingness. Of course, how productive and
useful it turns out to be will be determined in the execution.

Secondly, it has been apparent that some key people connected with TC 37
want to work with SIL and get some kind of connection established between
ISO 639 and the Ethnologue, the latter being viewed as the best thing
available in the way of a comprehensive list. Haavard sent me the Access DB
he referred to in N835, and invited our comments. Gerhard Budin (chair of
SC2) talked of the value of the Ethnologue at a meeting on metadata back in
Jan or Feb not knowing that my colleague Gary Simons was there; they went
on to have some discussions on interactions between the Ethnologue and ISO
639, with Gerhard indicating that he was willing to look at different
options including wholesale adoption of the balance of the Ethnologue, or
making SIL an approved regstration authority over a portion of a new
namespace (a part 3 of ISO 639).

This indicates two things to me: that people in TC 37 may be open to a
truly comprehensive standard, and that they are open to making use of the
Ethnologue, which seems to be a significant shift from earlier indicators I
had received, and gives some common ground with at least some folk on UTC.

Third, in spite of these positive indicators, it is not to be assumed that
left to its own devices TC 37 would necessarily come up with exactly what
all or even most of the stakeholders in this issue might need and want to
see happen (though I'm not sure the stakeholders themselves have entirely
figured out what they need). On the other hand, however, I have seen an
openness from Gerhard and the relevant ANSI TAG to get all of the
stakeholders at the same table.

Fourth, it's my impression that Haavard would like to see the denotation of
ISO LANGIDS be pinned down and documented better. I see this reflected in
what he put into the DB, but also in his invitation for us to offer input.
Of course, getting the meaning of something like "ar" nailed down will
likely require the consensus of the WG rather than being something he can
decide unilaterally (or, at least, I'm guessing that's the case). I don't
see how they can possible move forward with any of the work items proposed
in N835 without having taken that important step. (E.g. you can't really
add a bunch of new things until it's clearer just what you already have.)

In this latter regard, Gary Simons has been doing a bunch of research, and
has created a SQL database with explicit mappings between ISO 639-x and the
Ethnologue. He has the tables prepared for everything in ISO 639-1, and is
preparing to send a report to Gerhard and Haavard probably this week. This
represents comments of the sort Haavard requested (Gary used Haavard's info
as a primary source, though I think there are some cases where he is
proposing different decisions from what Haavard had made), but it goes
somewhat farther. For one thing, it has an ASP interface and is able to
respond with HTML reports to queries over the net (XML, obviously, also
possible). Secondly, not only does it not only make indicate which
Ethnologue code(s) a given ISO code corresponds to, but it also makes it
clear when an ISO code is referring to a cluster of languages (not always
evident from the specified name), and it also makes clear if there are
cases of Ethnologue entries with names that are similar to the ISO names
but which are *not* in fact part of the denotation of that ISO code. In
other words, it has the potential to provide the explicit documentation
needed by ISO 639-1 and -2 to make clear just what each of the 2- and
3-letter codes denotes, and thereby address one of the key problems Gary
and I discussed in our IUC 17 paper.

This work is important for us, whatever happens. It will be necessary to
bring about something connecting ISO 639 and the Ethnologue, as Gerhard has
suggested, but it will also be an essential step if RFC 3066 were to be
extended by a mechanism to allow independent naming authoritites and we
wanted to see SIL become one such naming authority (to maintiain backward
compatibility with the existing Internet standard, an ISO code would have
to be used in preference to an Ethnologue code for the same denotatum). So,
things are progressing in this key regard, at least.


Fifth, in N835, Haavard has suggested a program of work for coding of
language variation. Depending on exactly what  people are thinking of
codifying, this is potentially dangerous territory -- dangerous in the
sense of being rich in problems that could make success unlikely. For
example, if someone wants to codify dialects for all the world's languages,
or even all dialects of a few major languages, this is in principle not
possible since there is no operational definition of dialect that can give
any kind of objective results. The output of such an effort would be a
hodgepodge of identifiers that are unclear as to their meaning and are used
with considerable inconsistency. Similarly for genetic classifications
(Haavard specifically mentions a "formalism to express the hierarchy of
language families", by which I assume he must mean genetic
classifications), there is a limited level of genetic depth at which
linguists will have a fair amount of agreement for many language families,
but at any significant depth in the family trees there is more often
considerable disagreement, which is to be expected when linguists are
proposing theories that are to some extent mere conjecture based on limited
evidence -- there usually just isn't a lot of data available. In summary,
there are many aspects of language variation in which a technical body such
as TC 37 (or UTC) does not have the necessary expertise to make appropriate
judgements, and many which in principle may not be amenable to
comprehensive formalisation.

This does not mean that certain aspects of language variation should not be
explored by TC 37 or other technical bodies. There are some dimensions of
variation in linguistic and paralinguistic (e.g. writing system) categories
that are appropriate for formalising, and some that may not be easy to
formalise but which industry does need to grapple with. A careful analysis
is needed, however, to determine what the actual needs for information
technology are, and what the best approach is to meeting each of those
needs, given the nature of something as dynamic and variable as language.
Before we try to codify "language variation", we need to be clear as to
exactly what dimension of variation it is we're trying to codify, why we
are doing it / what it needs to accomplish, and that there is an approach
to the codification that will meet the IT needs and also can succeed.


- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>


	3