L2/01-056

From: Sandra O'donnell USG [odonnell@zk3.dec.com]
Sent: Wednesday, January 24, 2001 10:44 AM

Subject: Ethnologue summary (DRAFT!)

*******************************************************************
Attached is a draft of the progress report I've written for next week's
UTC. I've tried to capture the areas where we have reached agreement,
and identify where we still are in discussion mode.

I need to give Arnold the "official" report by Thursday afternoon
(East Coast time), Jan 25 because I will not be in the office after
that. Please read through this and let me know if you think I've
summed this up accurately. Obviously, we still have many open issues
on the table.

		-- Sandra
-----------------------
Sandra Martin O'Donnell
Compaq Computer Corporation
sandra.odonnell@compaq.com
odonnell@zk3.dec.com

*****************************************************************
Over the last few months, a subset of UTC members have been working on
refining the requirements for a language namespace authority, and also
trying to determine what changes (if any) to recommend to SIL's
Ethnologue to qualify as a language registry. This is in response to
Action Item 85-2.

Peter Constable, Rick McGowan, and I have been the active participants
in this discussion, with Gary Simons, Lisa Moore, and Arnold Winkler also
receiving the messages. Lisa has encouraged us to focus more on refining
generic requirements than on Ethnologue specifics.

Among the requirements for which there is general agreement are:


   * Provides unique and stable identifiers for any given entry.
This means, in part, that once an identifier has been assigned, it
cannot be changed to refer to another language. 

   * Contains identifiers for a range of languages, along with
predictable ways to add new identifiers in new versions over time.
No authority would be required to include all languages. One might,
for example, cover living languages while another covers extinct
languages.

   * Lets individual namespace authorities decide how to define
languages, rather than attempting to impose a single definition
across authorities.

   * Clearly documents the type of category (individual language, group of
languages) and the specific speech variety (or varieties, in the case of a
langid for a group of languages) denoted by any given identifier.

   * Provides enough information about each entry to make the identity
clear. A language name is simply not adequate for several reasons:
different languages in different parts of the world may share the same
name. Also, a single language may be referred to using different names,
even by speakers of the language (e.g. different names used in different
countries or by different ethnic groups).

   * Is assigned a unique identifier for a given namespace authority.
For example, "n-sil-" might be the tag that identifies the SIL
Ethnologue codes. 


We have also discussed, but failed to reach concensus on several topics.
They include:

- Free availability. Some believe all registered codes must be free
available to all on the Internet; others believe this would be nice,
but is not required.

- Control. If an organization becomes a namespace authority, does it
retain all control of all codes, or is there an oversight authority
that can resolve conflicts? For example, what if the IETF-languages
list/Language-Tag Reviewer wants to review the Ethnologue? What if
IETF creates a competing namespace authority that starts with Ethnologue
data and then makes "fixes" to perceived problems? 

- Cross-referencing. If identifiers exist for the same language in
multiple namespace authorities, is there a hierarchy for which code
from which authority must be used first? Must authorities provide a
cross-reference mapping between their identifiers and "the same"
identifiers in other namespaces? Some believe trying to provide
cross-references would be impossible given the differing ways
authorities define languages, while others believe this is required
because of existing practice and RFC 1766.


Regarding the Ethnologue itself, we agree that the current
three-letter codes are not sufficient to support future growth while
also permanently retiring codes once they are removed from an existing
version. The three-letter space allows for 17,576 permutations, and a
little less than one-half of those are still available. We have been
considering changing the codes to allow digits (0-9) as well as
case-insensitive letters (a-z). A four-letter code would be more
mnemonic, but may not work given RFC 1766's space requirements.

We agree that if the Ethnologue became a language registry, the codes
(ENG, FRN, JPN, etc.) would be normative, while all other information
would be informative.

We have not reached agreement on what information should appear in
Ethnologue entries. We generally agree on these items:

- Language name
- Code (normative)
- Where spoken
- Approximate number of speakers

However, there is disagreement about items such as:

- dialects
- alternate names
- Bible availability
- linguistic roots
- miscellenous information

Some believe all existing information should appear, especially since
all would be informative (other than the code itself). Others believe
some information is either inappropriate for an international standard,
or the available information is so inconsistent that it would confuse
users of the standard. And even though it would be informative, those
who support other international standards know that informative information
can be a maintenance issue.

There also has been debate about whether all existing Ethnologue
entries or a subset should be registered. Most believe all should be
registered; others are concerned about the sometimes-very-sketchy data
available with some entries.