L2/00-437 From: Sandra O'donnell USG [odonnell@zk3.dec.com] Sent: Wednesday, December 13, 2000 3:50 PM Subject: Ethnologue discussion ********* First, I must apologize for how long it's taken me to send out a followup to the November UTC discussion about the Ethnologue. I have been embroiled in a never-ending, ever-more-complex debate on the nature of internationalized regular expressions in the Single Unix Standard, and that has consumed a you-wouldn't-believe amount of my time. Because it has been a while since the UTC, I'll try to summarize where I think we are, and make suggestions for how we move forward. Feel free to correct or enhance my memory. I think we all agree that ISO 639 is not an adequate solution for language identification. Can the Ethnologue be the best way to refer to languages not covered in ISO 639, or languages that are imprecisely defined? Should we use all of the Ethnologue, or some subset of the entries and/or information? Here are some issues I wrote down from the discussion: * If the Ethnologue becomes an IT standard, who maintains it? Among the options are: + SIL maintains complete control + SIL continues updating the entries (i.e., does all the "real work") with oversight from a to-be-named review board + Another group takes over the standard all together + Other options? How would disputes be resolved? New versions issued? * What pieces of information from the Ethnologue would users see if they looked up a given value? Among the current data is: + three-letter code + region/country/countries where spoken + estimated number of speakers (also second language speakers, if relevant) + linguistic roots + dialects (if any) + alternate names + availability of Bible + similarity with other languages + misc. facts + other?? I believe we all agree dialect information would be omitted. I also believe the Bible information is not relevant for IT use, and that the alternate names data is problematic. What do others think? * [Not in my notes, but I just thought of this] Should new categories of information be added to entries if this becomes a standard? Possibilities might be the script(s) used to write the language; directionality; living/dead status; others? * Should all Ethnologue entries or only a subset be part of the IT standard? If we decide on a subset, what should be omitted? Any current entry that ends with the phrase "Survey needed"? Something else? * What are the criteria for adding/changing values in the Ethnologue? John Fiscella showed me something that said the 12th edition of the Ethnologue has 8,571 languages, while the 14th version has 6,800. I can't confirm either number, but if they are correct, why are they so different? If a three-letter code is used in one edition, and that language is removed in a follow-on edition, can the code be reused? If not, how many codes are still available? If codes can be reused, how do we resolve version conflicts? In either case, are there enough available values in the three-code space to allow for anticipated growth over the years (remember Unicode thought 16 bits would be enough)? * I think we said during the discussion that the Ethnologue does not cover invented (e.g., Klingon) or dead (e.g., Egyptian, Coptic) languages. If that's correct, are there plans to expand coverage? Or would the position be that another registry would have to handle those categories? * What cross-reference information would be available with each Ethnologue entry? (My notes here are vague; does anyone remember what this was about?) Those are the issues I remember. Please add more if you remember them. Then, I think it would be useful to try to answer some of the questions I've outlined here. Once we agree, we can use that to bring a report to the UTC. - ----------------------- Sandra Martin O'Donnell Compaq Computer Corporation sandra.odonnell@compaq.com odonnell@zk3.dec.com