Re: the Ethnologue

Date: Tue Sep 12 2000 - 23:24:11 EDT

On 09/12/2000 08:08:14 PM "Christopher J. Fynn" wrote:

>I'm not qualified to judge the merits of one list over another
>but there certaily are other comprehensive and well researched
>lists e.g. the Linguasphere Register of the World's Languages
>and Speech Communities see:
>Unfortunately their list is not available online, you have to buy
>the book - a bit like ISO/IEC 10646 and many other standards
>I do know that the way the compilers of the Linguasphere have
>classified languages and dialects is different than the way the
>compilers of the Ethnolouge have - though I'm sure both could
>give you well reasoned arguments why their scheme is better
>or more useful than the other.

I think the Linguasphere is a valueable publication, and the only
alternative I'm aware of that is a contender in place of the Ethnologue. My
concerns about it are:

- As Chris mentioned, the info isn't available online. I consider the
availability of online documentation to back up a set of codes to be
essential. Otherwise, there is no easy way for users to find out what
things mean.

- The Linguasphere uses a hierarchical system that begins with 10 divisions
in each of 10 major regions. This was done specifically to avoid questions
about higher-level genetic relationships, but the divisions end up being
rather arbitrary. The languages of the world do not in fact neatly divide
into 10 major groups in each of 10 major regions.

- There is a multi-level hierarchy that begins at levels above what the
Ethnologue considers to be a language, and goes below that level. There is
no certainty that one category in one place within the Linguasphere catalog
that is at a given level represents exactly the same kind of object as
other categories at the same level elsewhere in the catalog. Also, it is
not clear which of these levels are or are not useful for the purposes of
language-specific processing.

In contrast, it is our experience that the categories reflected in the
Ethnologue are the most generally useful for language-specific processing.
There are some exceptions to this (e.g. Murray Sargent pointed out that
there are regional-variant spelling conventions for English), but these are
the exception rather than the norm. Note also that something like spell
checking involves a *paralinguistic* notion, viz. spelling/orthographic
conventions, rather than the notion of *language* itself. There are clearly
cases of language-specific processing which will need to rely on some
paralinguistic notion such as "spelling/orthographic convention" or
"writing system". On the one hand, this area is not yet well enough
understood to come up with comprehensive enumerations of identifiers for
these various purposes. Secondly, identifiers that are appropriate purposes
will generally build from a set of *language* identifiers as a starting
point. (E.g. if you're going to enumerate writing systems, you'll need to
begin with an enumeration of languages.) As Rick responded to Murray,
Ethnologue codes don't solve all problems, but they do give us a
comprehensive list of modern languages that represents a good starting
point from which to work.

So, for these three reasons, I don't think the Linguasphere is as good a
choice for language identifiers for IT purposes. It would be useful for
documenting what identifiers within some system of identifiers denote,
except that the information is not available online.

Some are of the opinion that a hierarchical system is needed. A few people
at IUC17 commented that Ethnologue codes should be supplemented in this
way. Two comments:

1. Someone in the discussion time pointed out that there are many possible
alternate hierarchies based on orthogonal factors (e.g. inferred genetic
relationship, historical connections, geographic proximity, linguistic
similarity, related writing traditions, ...). It would be impossible to
have a single hierarchy that does all of this. (One further comment about
Linguasphere: I haven't read all of the introductory material, but there is
an indication that the choice was made to *not* base the hierarchy on
inferred genetic relationships since this was not considered relevant for
understanding the current socio-linguistic settings of language
communities. That raises the question of just what basis Linguasphere's
hierarchy *is* built on - it's not clear to me what this is.)

2. I don't think there is a clear understanding of what purposes
hierarchical categories would serve. Certainly a hierarchical,
non-leaf-node category can be useful for subject indexing (e.g. to find any
materials about Uto-Aztecan languages), but I don't think it's clear what
other useful purpose such a category would serve. I think it would be
better that identifiers for subject catalogs *not* get mixed up with
identifiers for language-specific information processing. In general,
non-leaf-node categories (such as Uto-Aztecan) are not useful for
language-specific processing. E.g. if all you know about the language of an
information object is that it is some Uto-Aztecan language, then you don't
have enough information to successfully spell-check. A comprehensive list
of leaf-node identifiers clearly would be useful, however. We should begin
by adopting such a set, and revisit the issue of hierarchical non-leaf-node
identifiers after their usefulness is better understood.

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT