L2/01-060

From: Peter_Constable@sil.org
Sent: Wednesday, January 24, 2001 5:45 PM

Subject: [langids] relationships between namespaces (was Ethnologue
discussion)


On 01/23/2001 12:39:54 PM "Sandra O'donnell USG" wrote:
>I understand why you are thinking of the second approach, but I
>think That Way Lies Madness...


On 01/23/2001 03:24:08 PM Rick McGowan wrote:
>Anyway, I agree it is really unwise as well as untenable to try to make
>name spaces interdependent or overlapping as in alternative 2)...

I think we need to discuss some of the issues involved further before we
reach any conclusion since I think there is more at stake than has been
mentioned thus far, and I also think this is something that could generate
very strong reactions. Let me see if I can convince you that alternative 1
is not so obviously better than 2.

I mentioned that alt2 entails a requirement on an agency seeking to become
a registered namespace authority to provide a cross-reference to existing
langids, while alt1 makes no such requirement on them. On the other hand,
we have to consider whether alt1 doesn't simply transfer that burden from
the potential namespace authority - who are the ones in the best position
to do it - to users, software developers and other clients of the overall
standard. Alt1 creates the potential that XML files or HTML pages can tag
English (for example) data in several different ways: "en", "eng", "angl",
"inkrish", etc. and users or software agents would need to know to equate
these. Keep in mind that the users and developers of those software are the
ones least in a position to evaluate the relationships between these codes.

There are two alternatives I can think of under alt1 to each developer
needing to support multiple synonymous codes:

- Each software developer chooses which namespaces they wish to support,
each one doing what is right in their own eyes. This is the very antithesis
of interchangeability, and would be clearly chaotic.

- Effectively, most or all choose to simply stick with just ISO or ISO &
IANA codes. In that case, we haven't accomplished anything, unless we think
we can find a namespace that meets a greater set of needs than ISO and that
can topple ISO to become the new de facto industry standard.

Also, keep in mind that, as it was conceived in the motions that were
passed by UTC, what we're working on is intended to result in a replacement
to / revision of RFC1766 -- something owned by IETF. We need to consider
what IETF is likely to be willing to consider, and I very much question
whether they would be open to alt1. If we are willing to suggest that
people can choose between (say) Ethnologue's "afk" for Afrikaans vs. ISO
639's "af", then we might just as readily suggest that people are free to
choose between ISO 639's "af" and ISO 639-1's "afr" (or either "en" or
"eng", "fr" or "fra", etc.), which they very clearly and decidedly have
chosen not to do in RFC 1766bis. I really think that they would flatly
reject such a proposal. It certainly would be a hard sell.


Let me respond to some of your arguments:

SO>The problem comes when we get past the "easy" languages into those
>where ISO 639 has an ambiguous entry that could map to multiple
>SIL codes, or some other pair of registries has not-quite-matching
>definitions of languages, and therefore not-quite-matching codes.

This would indeed be a challenge, but I'm not sure it can be avoided, and
so I think the best thing is to get the people in the best position to work
that out to do so. The hardest thing about mapping Ethnologue codes to ISO
codes (based on some experience on our part) is knowing just what some of
the ISO codes are supposed to mean. I had been hoping that one of the
by-products of our work might be that there is increased pressure from
industry on the registration authorities for the ISO sets to make clear
just what each of their codes means.

Indeed, there is still the possibility that some codes will not align
neatly between the two systems. The most likely cases of non-identity are
that codes from one system represent more specific varieties than the code
from the other system (or to make an analogy with sets the ones are proper
subsets of the other); in other words, a many:one relationship. I think it
would be rare to have cases in which you have to have many:many. I don't
know this for certain, but I think that's most likely. Once you have an
idea of the relationship, it's not hard to describe that relationship
somehow. In some discussions I've had with Gary, I've suggested that we not
only have to indicate which Ethnologue codes map into which ISO codes, but
also that we have to indicate the nature of the relationship. For example:

ethn: hop "Hopi" ==> iso639-1: nai "North American Indian (Other)",
specific-generic

Stating these kinds of things isn't hard. What's hard is knowing what the
relationships are, and the biggest obstacle to that, I think, is getting
ISO to clarify what their codes mean. If they would only do that, I don't
think there would be too many difficulties, mainly because their inventory
isn't too extensive. Now, if it came to mapping, for example, Ethnologue
codes to codes from some other large catalogue of languages with a
different inventory numbering in the thousands, that could take a lot of
work. I don't think it's very likely, though, that we'd see another agency
wanting to introduce a set like that. There just aren't too many agencies
that have done the amount of work it takes to catalogue thousands of
languages to even be in a position to do so.


>This just convinces me more that requiring cross-referencing among
>registries won't work. How can users know what was registered first?

A requester doesn't need to know; the IANA language tag reviewer would
know. As for users, once an agency is approved as a registered namespace
authority (this would be a requirement on them), they need to make clear
which of their codes, if any, are not to be used, and also provide a
mapping to indicate what previously-existing codes should be used instead.


>In your hypothetical example, you assume IANA won't accept Pumalarky
>because it's already covered, but what if the requestor argues that
>his/her Pumalarky is different from the Ethnologue's Pumalarky?

If they can explain how they think it's different, then they should be able
to get it registered with IANA, just as is the situation today.


>Then
>it might be registered, and now there are identical names for what may
>or may not be the "same" language. How do you cross-reference that?

Identical names for non-synonymous entities is not the concern. We already
have that in the clearer case of completely unrelated languages sharing the
same name. What matters is that no two *codes* are the same, and that the
meaning of codes is clearly documented. If we ended up with your code
"n-dec-pmky" and my code "n-sil-puy", and both you and I call these
languages "Pumalarky" but they're actually different, then a comparison of
the documentation that you and I hopefully will provide indication of the
differences.

Bear in mind that the problem you raise here can just as readily arise in
alt1. The big difference is that you and I (continuing the hypothetical
example) have not been required to make any attempt to map our codes to the
others. Under alt2, we have had to do this, and the fact that "n-dec-pmky"
does not map directly to "n-sil-puy" at least tells users that there is
some assumed difference between these. Under alt1, users don't have that
helpful info.


RM>  It just
>won't work in reality and will cause nothing but trouble for everyone.  We
>should provide machine-readable cross reference tables where applicable
and
>not try to interleave them.

The question, I think, is who is going to provide those cross reference
tables. It seems to me that, under alt1, there's no guarantee that anyone
will be held responsible for providing this info.


Peter