L2/01-056 From: Sandra O'donnell USG [odonnell@zk3.dec.com] Sent: Wednesday, January 24, 2001 10:44 AM Subject: Ethnologue summary (DRAFT!) ******************************************************************* Attached is a draft of the progress report I've written for next week's UTC. I've tried to capture the areas where we have reached agreement, and identify where we still are in discussion mode. I need to give Arnold the "official" report by Thursday afternoon (East Coast time), Jan 25 because I will not be in the office after that. Please read through this and let me know if you think I've summed this up accurately. Obviously, we still have many open issues on the table. -- Sandra ----------------------- Sandra Martin O'Donnell Compaq Computer Corporation sandra.odonnell@compaq.com odonnell@zk3.dec.com ***************************************************************** Over the last few months, a subset of UTC members have been working on refining the requirements for a language namespace authority, and also trying to determine what changes (if any) to recommend to SIL's Ethnologue to qualify as a language registry. This is in response to Action Item 85-2. Peter Constable, Rick McGowan, and I have been the active participants in this discussion, with Gary Simons, Lisa Moore, and Arnold Winkler also receiving the messages. Lisa has encouraged us to focus more on refining generic requirements than on Ethnologue specifics. Among the requirements for which there is general agreement are: * Provides unique and stable identifiers for any given entry. This means, in part, that once an identifier has been assigned, it cannot be changed to refer to another language. * Contains identifiers for a range of languages, along with predictable ways to add new identifiers in new versions over time. No authority would be required to include all languages. One might, for example, cover living languages while another covers extinct languages. * Lets individual namespace authorities decide how to define languages, rather than attempting to impose a single definition across authorities. * Clearly documents the type of category (individual language, group of languages) and the specific speech variety (or varieties, in the case of a langid for a group of languages) denoted by any given identifier. * Provides enough information about each entry to make the identity clear. A language name is simply not adequate for several reasons: different languages in different parts of the world may share the same name. Also, a single language may be referred to using different names, even by speakers of the language (e.g. different names used in different countries or by different ethnic groups). * Is assigned a unique identifier for a given namespace authority. For example, "n-sil-" might be the tag that identifies the SIL Ethnologue codes. We have also discussed, but failed to reach concensus on several topics. They include: - Free availability. Some believe all registered codes must be free available to all on the Internet; others believe this would be nice, but is not required. - Control. If an organization becomes a namespace authority, does it retain all control of all codes, or is there an oversight authority that can resolve conflicts? For example, what if the IETF-languages list/Language-Tag Reviewer wants to review the Ethnologue? What if IETF creates a competing namespace authority that starts with Ethnologue data and then makes "fixes" to perceived problems? - Cross-referencing. If identifiers exist for the same language in multiple namespace authorities, is there a hierarchy for which code from which authority must be used first? Must authorities provide a cross-reference mapping between their identifiers and "the same" identifiers in other namespaces? Some believe trying to provide cross-references would be impossible given the differing ways authorities define languages, while others believe this is required because of existing practice and RFC 1766. Regarding the Ethnologue itself, we agree that the current three-letter codes are not sufficient to support future growth while also permanently retiring codes once they are removed from an existing version. The three-letter space allows for 17,576 permutations, and a little less than one-half of those are still available. We have been considering changing the codes to allow digits (0-9) as well as case-insensitive letters (a-z). A four-letter code would be more mnemonic, but may not work given RFC 1766's space requirements. We agree that if the Ethnologue became a language registry, the codes (ENG, FRN, JPN, etc.) would be normative, while all other information would be informative. We have not reached agreement on what information should appear in Ethnologue entries. We generally agree on these items: - Language name - Code (normative) - Where spoken - Approximate number of speakers However, there is disagreement about items such as: - dialects - alternate names - Bible availability - linguistic roots - miscellenous information Some believe all existing information should appear, especially since all would be informative (other than the code itself). Others believe some information is either inappropriate for an international standard, or the available information is so inconsistent that it would confuse users of the standard. And even though it would be informative, those who support other international standards know that informative information can be a maintenance issue. There also has been debate about whether all existing Ethnologue entries or a subset should be registered. Most believe all should be registered; others are concerned about the sometimes-very-sketchy data available with some entries.