L2/01-056R

From: Sandra O'donnell USG [odonnell@zk3.dec.com]
Sent: Thursday, January 25, 2001 11:31 AM

Subject: Updated version of Ethnologue summary

Attached is the updated summary of discussions regarding requirements
for namespace authorities and the Ethnologue. This incorporates Peter
Constable's valuable suggestions.

We will discuss this at next week's UTC in Mountain View.

		-- Sandra
-----------------------
Sandra Martin O'Donnell
Compaq Computer Corporation
sandra.odonnell@compaq.com
odonnell@zk3.dec.com

***********************************************************************
Over the last few months, a subset of UTC members have been working on
refining the requirements for a language namespace authority, and also
trying to determine what changes (if any) to recommend to SIL's
Ethnologue to qualify as a language registry. This is in response to
Action Item 85-2.

Peter Constable, Rick McGowan, and I have been the active participants
in this discussion, with Gary Simons, Lisa Moore, and Arnold Winkler also
receiving the messages. Lisa has encouraged us to focus more on refining
generic requirements than on Ethnologue specifics.

Among the requirements for which there is general agreement are:

   * Provides unique and stable identifiers for any given entry.
This means, in part, that once an identifier has been assigned, it
cannot be changed to refer to another language. 

   * Contains identifiers for a range of languages, along with
predictable ways to add new identifiers in new versions over time.
No authority would be required to include all languages. One might,
for example, cover living languages while another covers extinct
languages.

   * Lets individual namespace authorities decide how to define
languages, rather than attempting to impose a single definition
across authorities.

   * Clearly documents the type of category (individual language, group of
languages) and the specific speech variety (or varieties, in the case of a
langid for a group of languages) denoted by any given identifier.

   * Provides enough information about each entry to make the identity
clear. A language name is simply not adequate for several reasons:
different languages in different parts of the world may share the same
name. Also, a single language may be referred to using different names,
even by speakers of the language (e.g. different names used in different
countries or by different ethnic groups).

   * Is assigned a unique identifier for a given namespace authority.
For example, "n-sil-" might be the tag that identifies the SIL
Ethnologue codes. 


We have also discussed, but failed to reach concensus on several topics.
They include:

- Cross-referencing and hierarchies. This is easily the most serious issue
we are grappling with, and the two options we have debated represent
significantly different approaches. The issue is two-fold:

  * If identifiers exist for the same language in multiple namespace
  authorities, is there a hierarchy for which code from which authority
  must be used first?
  -and-
  *  Must authorities provide a cross-reference mapping between their
  identifiers and "the same" identifiers in other namespaces?

Some believe trying to provide cross-references would be impossible given
the differing ways authorities define languages, while others believe this
is required because it will otherwise push the problem to application
developers, who will solve it in inconsistent ways. Also, IETF currently
has rejected in RFC1766bis using multiple tags for the same languages,
even if they are defined in separate namespaces.

- Free availability. Some believe all registered codes must be freely
available to all on the Internet; others believe this would be nice,
but is not required.

- Control. If an organization becomes a namespace authority, does it
retain all control of all codes, or is there an oversight authority
that can resolve conflicts? For example, what if the IETF-languages
list/Language-Tag Reviewer wants to review the Ethnologue? What if
IETF creates a competing namespace authority that starts with Ethnologue
data and then makes "fixes" to perceived problems? 


Regarding the Ethnologue itself, we agree that the current
three-letter codes are not sufficient to support future growth while
also permanently retiring codes once they are removed from an existing
version. The three-letter space allows for 17,576 permutations, and a
little less than one-half of those are still available. We have been
considering changing the codes to allow digits (0-9) as well as
case-insensitive letters (a-z). A four-letter code would be more
mnemonic, but may not work given RFC 1766's space requirements.

We agree that if the Ethnologue became the basis for a language registry,
the codes (ENG, FRN, JPN, etc.) would be normative, while all other
information would be informative.

We have not reached agreement on what information might be extracted from
the Ethnologue to create the registered material SIL would submit. We
generally agree on these items:

- Language name
- Code (normative)
- Where spoken
- Approximate number of speakers

However, there is disagreement about items such as:

- dialects
- alternate names
- Bible availability
- linguistic roots
- miscellenous information

Some believe all existing information should appear, especially since
all would be informative (other than the code itself). Others believe
some information is either inappropriate for an international standard,
or the available information is so inconsistent that it would confuse
users of the standard. And even though it would be informative, those
who support other international standards know that informative information
can be a maintenance issue.

There also has been debate about whether all existing Ethnologue
entries or a subset should be registered. Most believe all should be
registered; others are concerned about the sometimes-very-sketchy data
available with some entries.