Re: the Ethnologue

Date: Wed Sep 20 2000 - 13:30:28 EDT

On 09/16/2000 12:56:31 PM Doug Ewell wrote:

>Here's another thing about the Ethnologue list that has been almost,
>but not quite, addressed. Just so everyone knows, the point here is
>*NOT* that the six or seven thousand additional languages in Ethnologue
>are somehow not worthy of encoding, but that the list is incompletely
>edited and not ready to be enshrined as an international standard or
>as the basis for one.
>I downloaded the tab-delimited list (langcodes.tdf) from the SIL FTP
>site and discovered that some abbreviations were duplicated...

Doug, I'm afraid that your assessment is based on a misunderstanding of the
way the information you were seeing was organised. John Cowan touched on
this; I'll explain more fully. First, though, we do acknowledge a fault on
our part for allowing that data to be available without documenting how it
is organised.

The records in the text file you looked at are language-countries. It is
important to understand that the categorization is not reflected by the
records in that file, but by the three-letter codes. The reason for codes
being duplicated is because the languages in question are spoken in more
than one country.

The Ethnologue has, in the past, been maintained in a textual, flat-file
database. It was organised by language-countries to accommodate the
organisation of the published versions, in which the data are presented by
country then by language. A flat-file database was originally used because
the database dates back to before the advent of relational databases. Work
has begun to get the data into a relational structure. Once that is done,
it will be possible to view the data in other ways, including directly by

>I looked
>further and found 614 duplicate cases where the language code and
>primary name were identical, but the list of alternate names differed.

This probably reflects that alternate names are different from one country
to another.

>But it gets worse. When I stripped out the alternate-names field and
>again checked for duplicated codes, I found 14 (AVL AYL CAG CTO FUV GAX
>GSC GSW JUP MHI MHM MKJ SHU SRC). Some of these duplicates differ only
>in spelling (CAG 'Chulupi' vs. 'Chulupí')

Spelling differences are indeed an unfortunate example of inconsistency in
the data, and it exists exactly because a non-relational database has been
used. This will be cleaned up.

> but other differences are a
>lot more troubling. For example, SHU is both 'Arabic, Chadian Spoken'
>and 'Arabic, Shuwa.' As a non-expert in Arabic, how do I know these
>two names describe the same dialect of Arabic? (These are certainly
>dialects, not discrete languages.)

The intention is that you can tell that these are considered the same
language because they have the same three-letter code. It is not the name
that indicates the categorization, but the codes. The reason for
encountering two different records with the same code but different names
is that different names are considered the default or preferred form in
each country. Again, once the data has been re-organised relationally, it
will be possible to show that there is a single language, that it is spoken
in different countries, that there are various alternate names used, and
that certain names are associated with or are preferred in certain

>MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
>Absolutely *everyone* knows there is no one 'Slavic' language; the name
>refers to an entire language family. This is much more imprecise than
>any of the despised 'Other' codes in ISO 639.

As, I think, Michael Everson pointed out, "Slavic" is presented as one
alternate name that is sometimes used. The Ethnologue is *not* trying to
suggest that MKJ is all Slavic languages. Again, the view into the data has
unfortunately been misleading for you.

>SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
>means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
>'sr' to Ethnologue 'SRC'. This is likely to cause much more widespread
>trouble than the Hopi example mentioned earlier.

This is exactly an example of what Gary and I have argued: different
categorizations based on different operational definitions for different
purposes, each of which may be valid. The reason that the Ethnologue has
only one category where ISO 639-x has multiple categories is that the two
categorizations are based on different definitions for different purposes.
Ethnologue has only one because no evidence has been provided to indicate
that there are distinct, mutually non-intelligible speech varieties. That's
the primary basis of categorization.

This is not a problem at all. For applications that require (for whatever
reason) a distinction between Serbian, Croatian etc., the ISO codes are
available. But for applications that are concerned only with mutually
non-intelligible speech varieties, the best practice would be to have a
single code. There is no, single right way to "tile the plane". Your
example points directly to that fact. I should think you would agree with
us, then, that we need to acknowledge this fact, and move on to find a
solution. The solution we have proposed is the use of distinct namespaces
based on distinct operational definitions.

>And the duplicated codes in the Ethnologue list must be
>edited down to one code each, or the list will not earn the respect for
>accuracy that it perhaps deserves.

I hope I've adequately demonstrated that there is no problem in the data
that needs to be solved. What we do need to do is to provide users with
better views of the data, and we are working toward that end.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT