L2/01-056R From: Sandra O'donnell USG [odonnell@zk3.dec.com] Sent: Thursday, January 25, 2001 11:31 AM Subject: Updated version of Ethnologue summary Attached is the updated summary of discussions regarding requirements for namespace authorities and the Ethnologue. This incorporates Peter Constable's valuable suggestions. We will discuss this at next week's UTC in Mountain View. -- Sandra ----------------------- Sandra Martin O'Donnell Compaq Computer Corporation sandra.odonnell@compaq.com odonnell@zk3.dec.com *********************************************************************** Over the last few months, a subset of UTC members have been working on refining the requirements for a language namespace authority, and also trying to determine what changes (if any) to recommend to SIL's Ethnologue to qualify as a language registry. This is in response to Action Item 85-2. Peter Constable, Rick McGowan, and I have been the active participants in this discussion, with Gary Simons, Lisa Moore, and Arnold Winkler also receiving the messages. Lisa has encouraged us to focus more on refining generic requirements than on Ethnologue specifics. Among the requirements for which there is general agreement are: * Provides unique and stable identifiers for any given entry. This means, in part, that once an identifier has been assigned, it cannot be changed to refer to another language. * Contains identifiers for a range of languages, along with predictable ways to add new identifiers in new versions over time. No authority would be required to include all languages. One might, for example, cover living languages while another covers extinct languages. * Lets individual namespace authorities decide how to define languages, rather than attempting to impose a single definition across authorities. * Clearly documents the type of category (individual language, group of languages) and the specific speech variety (or varieties, in the case of a langid for a group of languages) denoted by any given identifier. * Provides enough information about each entry to make the identity clear. A language name is simply not adequate for several reasons: different languages in different parts of the world may share the same name. Also, a single language may be referred to using different names, even by speakers of the language (e.g. different names used in different countries or by different ethnic groups). * Is assigned a unique identifier for a given namespace authority. For example, "n-sil-" might be the tag that identifies the SIL Ethnologue codes. We have also discussed, but failed to reach concensus on several topics. They include: - Cross-referencing and hierarchies. This is easily the most serious issue we are grappling with, and the two options we have debated represent significantly different approaches. The issue is two-fold: * If identifiers exist for the same language in multiple namespace authorities, is there a hierarchy for which code from which authority must be used first? -and- * Must authorities provide a cross-reference mapping between their identifiers and "the same" identifiers in other namespaces? Some believe trying to provide cross-references would be impossible given the differing ways authorities define languages, while others believe this is required because it will otherwise push the problem to application developers, who will solve it in inconsistent ways. Also, IETF currently has rejected in RFC1766bis using multiple tags for the same languages, even if they are defined in separate namespaces. - Free availability. Some believe all registered codes must be freely available to all on the Internet; others believe this would be nice, but is not required. - Control. If an organization becomes a namespace authority, does it retain all control of all codes, or is there an oversight authority that can resolve conflicts? For example, what if the IETF-languages list/Language-Tag Reviewer wants to review the Ethnologue? What if IETF creates a competing namespace authority that starts with Ethnologue data and then makes "fixes" to perceived problems? Regarding the Ethnologue itself, we agree that the current three-letter codes are not sufficient to support future growth while also permanently retiring codes once they are removed from an existing version. The three-letter space allows for 17,576 permutations, and a little less than one-half of those are still available. We have been considering changing the codes to allow digits (0-9) as well as case-insensitive letters (a-z). A four-letter code would be more mnemonic, but may not work given RFC 1766's space requirements. We agree that if the Ethnologue became the basis for a language registry, the codes (ENG, FRN, JPN, etc.) would be normative, while all other information would be informative. We have not reached agreement on what information might be extracted from the Ethnologue to create the registered material SIL would submit. We generally agree on these items: - Language name - Code (normative) - Where spoken - Approximate number of speakers However, there is disagreement about items such as: - dialects - alternate names - Bible availability - linguistic roots - miscellenous information Some believe all existing information should appear, especially since all would be informative (other than the code itself). Others believe some information is either inappropriate for an international standard, or the available information is so inconsistent that it would confuse users of the standard. And even though it would be informative, those who support other international standards know that informative information can be a maintenance issue. There also has been debate about whether all existing Ethnologue entries or a subset should be registered. Most believe all should be registered; others are concerned about the sometimes-very-sketchy data available with some entries.