L2/01-060 From: Peter_Constable@sil.org Sent: Wednesday, January 24, 2001 5:45 PM Subject: [langids] relationships between namespaces (was Ethnologue discussion) On 01/23/2001 12:39:54 PM "Sandra O'donnell USG" wrote: >I understand why you are thinking of the second approach, but I >think That Way Lies Madness... On 01/23/2001 03:24:08 PM Rick McGowan wrote: >Anyway, I agree it is really unwise as well as untenable to try to make >name spaces interdependent or overlapping as in alternative 2)... I think we need to discuss some of the issues involved further before we reach any conclusion since I think there is more at stake than has been mentioned thus far, and I also think this is something that could generate very strong reactions. Let me see if I can convince you that alternative 1 is not so obviously better than 2. I mentioned that alt2 entails a requirement on an agency seeking to become a registered namespace authority to provide a cross-reference to existing langids, while alt1 makes no such requirement on them. On the other hand, we have to consider whether alt1 doesn't simply transfer that burden from the potential namespace authority - who are the ones in the best position to do it - to users, software developers and other clients of the overall standard. Alt1 creates the potential that XML files or HTML pages can tag English (for example) data in several different ways: "en", "eng", "angl", "inkrish", etc. and users or software agents would need to know to equate these. Keep in mind that the users and developers of those software are the ones least in a position to evaluate the relationships between these codes. There are two alternatives I can think of under alt1 to each developer needing to support multiple synonymous codes: - Each software developer chooses which namespaces they wish to support, each one doing what is right in their own eyes. This is the very antithesis of interchangeability, and would be clearly chaotic. - Effectively, most or all choose to simply stick with just ISO or ISO & IANA codes. In that case, we haven't accomplished anything, unless we think we can find a namespace that meets a greater set of needs than ISO and that can topple ISO to become the new de facto industry standard. Also, keep in mind that, as it was conceived in the motions that were passed by UTC, what we're working on is intended to result in a replacement to / revision of RFC1766 -- something owned by IETF. We need to consider what IETF is likely to be willing to consider, and I very much question whether they would be open to alt1. If we are willing to suggest that people can choose between (say) Ethnologue's "afk" for Afrikaans vs. ISO 639's "af", then we might just as readily suggest that people are free to choose between ISO 639's "af" and ISO 639-1's "afr" (or either "en" or "eng", "fr" or "fra", etc.), which they very clearly and decidedly have chosen not to do in RFC 1766bis. I really think that they would flatly reject such a proposal. It certainly would be a hard sell. Let me respond to some of your arguments: SO>The problem comes when we get past the "easy" languages into those >where ISO 639 has an ambiguous entry that could map to multiple >SIL codes, or some other pair of registries has not-quite-matching >definitions of languages, and therefore not-quite-matching codes. This would indeed be a challenge, but I'm not sure it can be avoided, and so I think the best thing is to get the people in the best position to work that out to do so. The hardest thing about mapping Ethnologue codes to ISO codes (based on some experience on our part) is knowing just what some of the ISO codes are supposed to mean. I had been hoping that one of the by-products of our work might be that there is increased pressure from industry on the registration authorities for the ISO sets to make clear just what each of their codes means. Indeed, there is still the possibility that some codes will not align neatly between the two systems. The most likely cases of non-identity are that codes from one system represent more specific varieties than the code from the other system (or to make an analogy with sets the ones are proper subsets of the other); in other words, a many:one relationship. I think it would be rare to have cases in which you have to have many:many. I don't know this for certain, but I think that's most likely. Once you have an idea of the relationship, it's not hard to describe that relationship somehow. In some discussions I've had with Gary, I've suggested that we not only have to indicate which Ethnologue codes map into which ISO codes, but also that we have to indicate the nature of the relationship. For example: ethn: hop "Hopi" ==> iso639-1: nai "North American Indian (Other)", specific-generic Stating these kinds of things isn't hard. What's hard is knowing what the relationships are, and the biggest obstacle to that, I think, is getting ISO to clarify what their codes mean. If they would only do that, I don't think there would be too many difficulties, mainly because their inventory isn't too extensive. Now, if it came to mapping, for example, Ethnologue codes to codes from some other large catalogue of languages with a different inventory numbering in the thousands, that could take a lot of work. I don't think it's very likely, though, that we'd see another agency wanting to introduce a set like that. There just aren't too many agencies that have done the amount of work it takes to catalogue thousands of languages to even be in a position to do so. >This just convinces me more that requiring cross-referencing among >registries won't work. How can users know what was registered first? A requester doesn't need to know; the IANA language tag reviewer would know. As for users, once an agency is approved as a registered namespace authority (this would be a requirement on them), they need to make clear which of their codes, if any, are not to be used, and also provide a mapping to indicate what previously-existing codes should be used instead. >In your hypothetical example, you assume IANA won't accept Pumalarky >because it's already covered, but what if the requestor argues that >his/her Pumalarky is different from the Ethnologue's Pumalarky? If they can explain how they think it's different, then they should be able to get it registered with IANA, just as is the situation today. >Then >it might be registered, and now there are identical names for what may >or may not be the "same" language. How do you cross-reference that? Identical names for non-synonymous entities is not the concern. We already have that in the clearer case of completely unrelated languages sharing the same name. What matters is that no two *codes* are the same, and that the meaning of codes is clearly documented. If we ended up with your code "n-dec-pmky" and my code "n-sil-puy", and both you and I call these languages "Pumalarky" but they're actually different, then a comparison of the documentation that you and I hopefully will provide indication of the differences. Bear in mind that the problem you raise here can just as readily arise in alt1. The big difference is that you and I (continuing the hypothetical example) have not been required to make any attempt to map our codes to the others. Under alt2, we have had to do this, and the fact that "n-dec-pmky" does not map directly to "n-sil-puy" at least tells users that there is some assumed difference between these. Under alt1, users don't have that helpful info. RM> It just >won't work in reality and will cause nothing but trouble for everyone. We >should provide machine-readable cross reference tables where applicable and >not try to interleave them. The question, I think, is who is going to provide those cross reference tables. It seems to me that, under alt1, there's no guarantee that anyone will be held responsible for providing this info. Peter