Re: the Ethnologue

From: Peter_Constable@sil.org
Date: Wed Sep 13 2000 - 16:13:01 EDT


(Apologies for the cross-listing, but this has spanned several lists, and
there are parties on each that are not all on one and that are interested
in the discussion.)

On 09/13/2000 06:37:02 AM Michael Everson wrote:

>Ar 23:56 +0100 2000-09-12, scrobh Christopher J. Fynn:
>
>>A lot of what are listed as "languages" in the Ethnologue are what most
people
>>would call dialects. For instance almost every known dialect of spoken
Tibetan
>>is listed as a separate language in the Ethnolouge although they all
share
>>only one written form.
>
>YES. This is one of the serious problems of the list.

This is a fallacious argument against Ethnologues classification. As Gary
and I point out in our paper, there is not *one, correct* classification of
languages, but potentially many valid classifications serving different
purposes and depending upon ones operational definitions. We uphold
Ethnologues operational definitions, which use mutual non-intelligibility
as a primary factor. For some major languages that are very familiar to
people, this may not give a classification that offers what they're looking
for given their particular purposes, but this isn't about coming up with
new codes for major languages. It's about codes for the thousands of
languages that currently have nothing.

>Ar 22:39 -0800 2000-09-12, scrobh Jrg Knappen:
>
>>I once looked at the ethnologue...
>
>YES. This is one of the serious problems of the list.

I've already responded to this objection, and Michael hasn't added any new
argument.

>If SIL has 2000 real languages they need codes for in real applications,
>then those 2000 (which is a lot) should be proposed to 639 or 1766. That's
>what 639 and 1766 are for. It would be nice to know what the applications
>are.

Did you place the same requirements to demonstrate need including a
statment of what applications were anticipated on those looking for tags
for signed languages? No. You're not being reasonable here. I've already
enumerated several independent agencies looking for tags for all these
languages (the 6000+, not just 2000). It seems that you're continuing to
come back asking for more simply because you're for whatever reason not
wanting to accept that people could really be wanting all that. But you
don't have the pulse on all user needs, as is evident from this discussion.

>I do not think we should adopt all 6000 codes from the Ethnologue as
>"language tags". I am, frankly, shocked that linguists should consider
>doing so so uncritically.

Then perhaps, Michael, you'd like to go through the list and start telling
the linguists, anthropologists, governments, development agencies, etc. of
the world which of the 6000 they should feel free to ignore.

>Do you need a code for !X? May I ask what for?

For archiving linguistic data. For associating language-specific processes
to work with that data. For categorizing information about that speech
community that may be needed by government agencies interested in
education, or health maintenance, or economic development, or whatever; or
for categorizing similar information used by development or relief
agencies.

>There are (according to
>the Ethnologue) 3000-4000 speakers. According to Anthony Traill's _A !X
>Dictionary_, "!X is an unwritten language and its speakers have no
notion
>of linguistic standardization". Well, honestly, whose IT requirements are
>you going to serve? (All the characters used in the dictionary are in the
>UCS. I checked, because that *is* a real and important requirement.)

And if you're going to have a written representation of the language, how
will you apply processes like data validation (aka spell checking),
morphological analysis, etc. if you don't have a way to tag the
language-specific resources needed?

>Do you really need 8 codes for "the German languages"?

It's not German I'm concerned about.

>How many "Tibetans" are there?

Ethnologue lists only one *language* called Tibetan (TIC). It lists 36
languages from the Tibetan family, and if you're interested in (say) Lhomi
(LHM) or Jirel (JUL) or Ladakhi (LBJ), then "Tibetan" doesn't meet your
needs.

>Is Samvedi a language? It "shares many features with Gujarati.
>Survey needed". How many times does "survey needed" appear in the
>Etnologue? How many of them aren't really languages (in the sense that we
>need to implement for IT and libraries (which use IT by the way)) but only
>preliminary studies? HOW DO WE KNOW?

Ethnologue indicates which ones still require more research. Experience has
shown that in most cases these are not simply dialects of existing
languages but are distinct languages in their own right. At any given point
in time, the Ethnologue reflects the current understanding of the
situation. Just because that knowledge is not complete, there is not a good
argument to say that no identifier should be provided. By adding an
identifier now, what is at risk is that, at some later time, a relatively
*very small* amount of data becomes sub-optimally tagged. At that time, it
will be possible to refer to the Ethnologue based on the date of the data
and at least find out how knowledge of the sociolinguistic facts has
improved. That will give the user a good chance at succeding in making
sense of the data.

Compare this with the current situation with ISO 639-x tags: What is at
risk is that, as the standard evolves, a large volume of data may become
*incorrectly* tagged. Users will compare the tag with the existing
inventory an know nothing more than the language "some S. American indian
language other than one of this select list" (or something comparable for
other parts of the world). They may in fact be misled by the fact that the
actual language by that time has its own tag, and so is no longer included
in the denotation of the collective tag. If they have access to the
revision history for ISO 639-x, they are left with the question as to
whether it could possibly be one of the few languages for which codes were
added in the intervening period since the data was created, but they have
no way to know. This is a FAR, FAR, FAR worse situation than that described
above in relation to changes in knowledge within the Ethnologue. The *only*
way to improve upon that would be for the authors to supplement that data
with additional meta-information giving more detail on the identification
of the language in question, but without having access to any common
reference as a standard set of codes provides, and subject to exactly the
same changes in understanding of sociolinguistic facts about the particular
speech variety reflected.

In other words, Michael, your argument against Ethnologue *falls flat*, and
if anything argues against ISO 639-x and in favour of Ethnologue.

>This is just another error, and for a language in Western Europe. I do not
>believe that the Ethnologue can be taken so uncritically.

There is no debate that there are errors. That does not invalidate the
value of the whole. ISO 10646 has errors and imperfections, but is still
useful. Insisting on perfection amounts to obstructing any serious progress
because perfection on something as fuzzy and dynamic as human language is
impossible. In the mean time, you are ignoring the requests of *lots* of
users for the identifiers that they feel they need.

>There may be problems with 639 and 1766 but the committees in question
have
>been addressing these recently so that we can make and maintain more
>effective and responsive standards. Has that all been wasted effort? IT
>industry can circumvent the standards easily if it wants to. Is that a
good
>idea?

Nobody wants to circumvent the standards, as long as the committees are
willing to respond to the needs. ISO has not yet acknowledged all of the
problems that exist in ISO 639-x. One of these is that the list of codes
falls far short of what users need, and their process is not equipped to
accommodate a large volume of requests. There are other problems, which
Gary and I have pointed out.

>The Ethnologue is an important resource. I use it, along with other
>resources, in my work. But I don't think it is mature enough to BE an
>international standard. A namespace in RFC 1766 could be created easily:
>define a tag "e-" for "Ethnologue" and allow it next to "i-" and "x-". But
>I have grave concerns about the wisdom of doing so, and nothing Peter has
>said has dispelled them.

But you also haven't seen Gary's an my paper (I'll have to send it to you
offline, without the benefit of the few revisions I'd rather make first),
and you don't seem to be aware of the many users asking for a comprehensive
list of identifiers.

>Lest anyone think that *I* am coming to shrink from supporting minority
>languages, I will say this... I'll be looking further into this and
>taking appropriate action to get the missing characters encoded in the
>Universal Character Set.

I'm glad that you've been responsive to the needs of character encoding. My
hope is that I can convince you to become as responsive to the needs for
language identifiers.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT