Re: the Ethnologue

From: Antoine Leca (
Date: Thu Sep 14 2000 - 06:43:19 EDT

Peter Constable wrote:
> On 09/13/2000 10:25:21 AM Antoine Leca wrote:
> >While I agree with you, there are anyway problems with the way languages
> >are distinguished...
> Some comments in response:
> - This is not primarily about major languages.

I believe I was not clear enough.

Do you consider Valencian to be a major language?
If yes, why do Ethnologue negate it a different code?
If no, then I was pointing out that even major language such as Catalan
may lead to problem of subclassification. I analyse, with my deficient
knwoledge of German, that the problems that Jörg pointed out with German,
is that he thought Ethnologue went too far in that latter case.

I do not expect, not in fact want, an actual answer. But as I said,
"while I agree with you", anyway we had, have, and shall have, problems
with the tagging of languages; it is vain to expect even solutions
(and certainly not perfection, as you pointed out) in this field.
We have to live with imperfection, misinformation, fuzziness, etc.

> The aim of adding thousands of new language identifiers to some standard
> system is focused on the thousands of languages that currently have
> nothing, not to replace what is already there for the few hundred that
> are already covered.

First, this is a point that was not clear enough for me on the first time.
Perhaps the fact that I did not see any actual list of the potential
languages to be added is a problem here.

Then, my point about Valencian was to highlight that some languages
can be claimed, for political reasons, new codes to be embeeded in a
list of 2,000 new codes, thus leading to later problems of mis-tagging
for the IT industry. An obvious example from Ethnologue is the case
for the various dialects of the Oc language (I do not know if there are
considered for addition or not; but I know quite well what are the
politicals positions in this case, and I only see worms here, and
certainly no solutions to real problems).

> We can improve our systems as we understand the
> needs of different processes better. When we get to that point, it is
> likely that a comprehensive enumeration of languages will be much more of
> an assistance rather than a hindrance.

This is where I cannot agree with you.

> (All points that Gary and I have made in our paper.)

I am sorry, I was not able to assist at your conference for an annoying
problem of distance... and unfortunately, the paper is not yet online;
So certainly I am misunderstanding some of your points.

I want to apologize about that.

> >And while this problem is more or less
> >possible to deal with when it comes to the major languages with abundant
> >literature and standardized spelling, at the very time it narrows to
> >lesser used languages, problems will arise.
> Actually, in some respects it is major languages that create some
> complications that don't apply to lesser-known languages.

Good point. So I stand corrected here.

> On the other hand, it is not clear that an attempt to adopt
> a comprehensive enumeration of languages will lead to many more problems.

Certainly it will.
It will certainly not solve the problems with the major languages, since it
does not attempt to improve the situation here (and fragmenting some "languages"
such as Serbo-Croatian, Occitan, German or Catalan is not likely to improve
the situation, IMHO).
And about lesser-used languages, while it will recognise some current
practices, it will also introduce some new problems with all others systems
that should now deal with all these new codes (an obvious example is the UI
to tag something: at the moment, often a list with all the code from ISO-639-1
is presented).

Please note that I am *not* implying that this should preventing us to make
that move. I certainly do not want to sustain Michael's position.
But saying that is a cure without any harmful effect to much too strong
according to my taste.

> In fact, there is much less of a problem if a comprehensive list of
> identifiers based on the Ethnologue were available for two reasons:
> 1. The Ethnologue will record change history, and any changes would be from
> one *known* quantity to another.

I am not that sure, because the rules for tagging are not that fixed.

It is obvious that a list with 2,000 codes is better than one with 450.
There is more information. And it will be better with a list of 30,000 codes.

So if you are going to introduce "Lahu Shi" in place of "Sino-Tibetan (Other)",
you certainly increase the precision. Then, if in 3 years from now, there
is another subdivision, then information will again increase. I do not see
where there is a gap in the process here.

Certainly, the point you are making is that the codes should *never* lost
a part of their meaning: either they should stay as is, or they be _replaced_
by a whole set of new codes that covers the whole range.

So with "Sino-Tibetan (Other)" (or more to the point with "Sami languages"),
it would lead to problem(s) to split _a_part_ of these "languages" to cover
"Lahu Shi" (or "Northern Sami").

But if the whole code "Sino-Tibetan (Other)" is _replaced_ by a set of codes
(perhaps from Ethnologue surveys if they are recognised), you are just
increasing precision, while retaining the ancient information. No problem

Furthermore, as I noted about Valencian or Occitan, the risk with the Ethnologue
lists is that while certain codes are good to add, when it details presently
too fuzzy information, there are other codes which addition will lead to
exactely the same problem as you are describing.

> In contrast, with ISO 639, the data is tagged as a largely unknown quantity
> - in the example, "Sino-Tibetan (other)", and when the system is updated to
> add a specific tag based on new knowledge, then the existing data is
> incorrectly tagged, and still as a largely unknown quantity. Not only do we
> not have any way to know to what extent it is incorrectly tagged, we in
> fact don't even have any way to determine that it *is* incorrectly tagged.
> (I'm discovering that the problem is worse than I realised every time I
> explain it.)

This is only because you are thinking that "Sino-Tibetan (Other)" should
survive to the introduction of the 20+ codes you are referring to.
It should not.

Now, take data tagged as "fr-FR". Believe it or not, there are variations
(for example, Wallon or Norman or Gallo). Suppose we introduce new codes
for these variations. The only possible way to do that, is to also create
a new code for "Standard French", or something like that. I believe
everybody sees what for a nightmare this kind of ideas will lead
(particularly if the splitting process is done in two or three steps: the
first one will create "Non-Wallonic French", the second will create
"Regular French" --while still including Gallo--, to end with "Standard
French" which still includes Berrichon ;-)).
Furthermore, data which are really Wallon or Norman will still be tagged
as French for a lot of reasons (for example, availability of tools).
Just because there is a large similarity with Standard French.

Clearly, this is impossible. However, the situations for "Sino-Tibetan
(Other)" vs. "French" are two extremes in a range that exhibits all possible
intermediaries. I agree without reserve about the replacement of the
"... (Other)" codes when datas are available (but I believe that there
always exist datas that defies classification) and universally agreed
upon. But the level of discrimination between new language codes should
be carefully designed, because it is very very easy to go too far on this
respect. I believe this is the ground objection from Michael, BTW.
Now, when it comes to any new language code that attempt to describe
something that is currently embeeded into a larger-scale code...
Well, that's the problem. At least, in my eyes.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT