Here's another thing about the Ethnologue list that has been almost,
but not quite, addressed. Just so everyone knows, the point here is
*NOT* that the six or seven thousand additional languages in Ethnologue
are somehow not worthy of encoding, but that the list is incompletely
edited and not ready to be enshrined as an international standard or
as the basis for one.
I downloaded the tab-delimited list (langcodes.tdf) from the SIL FTP
site and discovered that some abbreviations were duplicated. I looked
further and found 614 duplicate cases where the language code and
primary name were identical, but the list of alternate names differed.
OK, I thought, I can see that; the list of alternate names was too long
for one line, so they made two lines and split the alternates between
them. Fair enough. (It's not quite that clean, but you get the idea.)
But it gets worse. When I stripped out the alternate-names field and
again checked for duplicated codes, I found 14 (AVL AYL CAG CTO FUV GAX
GSC GSW JUP MHI MHM MKJ SHU SRC). Some of these duplicates differ only
in spelling (CAG 'Chulupi' vs. 'Chulupí') but other differences are a
lot more troubling. For example, SHU is both 'Arabic, Chadian Spoken'
and 'Arabic, Shuwa.' As a non-expert in Arabic, how do I know these
two names describe the same dialect of Arabic? (These are certainly
dialects, not discrete languages.)
MKJ is the Ethnologue code for both 'Macedonian' and 'Slavic'.
Absolutely *everyone* knows there is no one 'Slavic' language; the name
refers to an entire language family. This is much more imprecise than
any of the despised 'Other' codes in ISO 639.
SRC is the code for 'Bosnian', 'Croatian', and 'Serbo-Croatian', which
means that there is a many-to-one mapping from ISO 639-1 'bs', 'hr',
'sr' to Ethnologue 'SRC'. This is likely to cause much more widespread
trouble than the Hopi example mentioned earlier.
Certainly more codes need to be added to ISO 639, and the Maintenance
Agency needs to be sure not to present an image of unresponsiveness
(if in fact they have been guilty of that in the past). However, they
have their own, existing guidelines for the level at which languages
should be encoded (one written vs. 60 spoken variants) and this must
be respected. And the duplicated codes in the Ethnologue list must be
edited down to one code each, or the list will not earn the respect for
accuracy that it perhaps deserves.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT