L2/03-261 From: ekeown@student.umass.edu Sent: Tuesday, August 05, 2003 6:08 PM To: lrajchel@ansi.org; jkarp@ansi.org Subject: Fwd: letter, request for instructions Elaine Keown TEMPORARILY in Madison WI Dear Lisa Rajchel and Jessica Karp: I am resending the letter below. It may have been lost in cyberspace or perhaps you are both on a very delightful vacation. Prof. Kelly has written me an interesting and hopeful response. In case other Hebraists wish to participate in this protest, where would they write you? I gather that computer standards work comes out of the NYC office. So I assume they could write to Ms. Rajchel with a copy to Ms. Karp. Please inform me what the proper emails are. I am still working on this issue. I also still don't have an email for Dr. Arnold. I will be re-writing the letter in a somewhat different format for broader use. Elaine Keown _________________________________________________ Elaine Keown Madison WI Dear Lisa Rajchel and Jessica Karp: I enclose below a letter of complaint which I have only managed to send to Prof. Kelly so far (via email). I am in the middle of moving and realized that it would not be appropriate to send via email to Dr. Hurwitz (even if I had his email). I haven't managed to fax him yet. If others wish to join me in this protest, what is the appropriate way to contact you (email, fax, semi-open letter on the Internet with private code) and which people in your organization should be written? Also, I found a George W. Arnold in Holmdel (via anywho.com), but I am not 100% sure that it is the right person. Elaine Keown __________________________________________________ En route to Madison, Wisconsin ekeown@student.umass.edu July, 2003 Dr. Mark W. Hurwitz President and CEO, ANSI Dr. George W. Arnold Board of Directors, ANSI Dr. William E. Kelly Board of Directors, ANSI RE: 1) real change for the Unicode Hebrew Block 2) Linguistic oversight for character set standards development Gentlemen: As Dr. Hurwitz may recall, I have been doing research on the complete character set for Hebrew (all 3,200 years) since summer 1999. It is a very discouraging activity. The two existing Unicode/ISO 10646 Hebrew blocks contain perhaps 45% of possible characters for Hebrew-script materials. As written, the code is not usable for computational Biblical Hebraists or for many Jewish studies scholars. I was the primary researcher in this work, although I do not (yet) have a Ph.D. However, I was assisted, corrected, and guided by two recognized scholars, one in medieval Italian-Jewish studies and another in Aramaic. I also have been in contact with some of the world's leading computational Hebraists or Aramaists. 1) I am writing to ask you to facilitate extraordinary action on behalf of Hebrew. Last year the Coptic Patriarch, the recognized representative for Coptic, still used in church liturgy, asked the Unicode Technical Committee to completely change their Coptic block. Such radical change is against UTC policy, but they did agree to move and completely redo the Coptic section of Unicode/ISO 10646. The Hebrew block needs improvements which cannot be done within Unicode's "no change to the code" policy. As written, it will not even fully represent the Leningrad Codex, the most widely used Biblical Hebrew manuscript. For over a decade, computational Biblical Hebraists have struggled with Unicode Hebrew. We cannot use it and are not permitted to truly fix it. I discuss this issue at length in Attachment 1. _________________________________________________________ Dr. Mark W. Hurwitz Dr. George W. Arnold Dr. William E. Kelly July, 2003 Page 2 2) I suggest that you greatly improve your system for overseeing character set development. I have two suggestions: a. Add excellent linguists from four or five linguistic subdisciplines (1-2.languages of the world and/or typology, 3. sociolinguistics, 4. phonetics, and 5. writing systems) to your various directorships. b. Facilitate much closer interaction between academia and the standards world. Since the 1960s, there has been far too much distance between serious academics--by which I mean research professors and Ph.D. candidates--and people who actually develop standards. A lot of earlier standards work was done by librarians. However, library software is less demanding than the software needed today by computational linguists. I discuss a) in Attachment 2. On a personal level, I waited over a decade now, in the American version of poverty, for Unicode Hebrew to change. I need to use Unicode for my own future corpus linguistics work. It is not possible for me to use the existing Unicode Hebrew block. Sincerely, Elaine Keown cc: Dr. Seth Jerchower, Center for Advanced Judaic Studies Library, UPenn Prof. Paul Flesher, Dept. of Religious Studies, University of Wyoming Prof. Michael O'Connor, Dept. of Semitics, Catholic University of America Prof. Kirk Lowery, Westminster Hebrew Institute, Westminster Theological Seminary, Philadelphia Other interested Hebraists __________________________________________ ATTACHMENT 1: COMPUTATIONAL HEBREW: HISTORY, DIFFICULTIES Hebrew was computerized early and was probably the second language to which IR was applied (early 1960s, Bar Ilan University and Weizmann Institute). However, the Hebrew which was computerized early was the great body of Hebrew literature without extra vowel, special punctuation, and text critical marks. Most such marked ("pointed") Hebrew consists of siddurim, machzorim, tikkunim, literature for students, and Biblical manuscripts. Religious poetry (piyyutim) and other manuscript genres sometimes also use "pointing." The marked Hebrew was computerized by a series of Biblical scholars since the 1960s in France. Most of the computational work, especially by Protestants and Catholics, was done on a single Biblical manuscript, the Leningrad Codex. However, none of these scholars, whether they worked in Italy, France, Israel, the U.S. or elsewhere, collaborated with a public standards body. In the late 1980s, when Unicode was starting up, some scholars tried very hard to persuade the UTC to pay attention to their private, academic code for Leningrad. Apparently they were ignored by the UTC. In Israel, the standards people in Tel Aviv were apparently unaware of all such academic work on Hebrew. So the Unicode Hebrew block was produced in isolation from previous academic Biblical Hebrew research. We have been told repeatedly by UTC members that the block cannot be corrected or changed, that it is set in concrete. The existing Unicode Hebrew block has at least the following problems: 1. The block is too small for the complete character set. Now Hebrew is in two separated blocks and it may end up in three. 2. The basic collation will not sort optimally even for unpointed and pointed (Tiberian) Hebrew, much less for the 30-40 other Hebrew-script languages. Hebrew, like other caseless scripts, collates more easily than the more modern scripts with case (e.g., Greek, Latin, Armenian). 3. The subsets within the block are badly grouped. *4. Only 3 Hebrew-script languages are partially covered, out of a possible 30-40. *5. The block is missing symbols needed for Leningrad and other critical codices (Aleppo, London, Cairo, the Qumran Isaiah). *6. Some symbols are conflated and need semantic differentiation for optimal usage in IR. *The starred items are easily fixable, at least compared to the unstarred ones. _____________________________ ATTACHMENT 2: BETTER OVERSIGHT FOR CHARACTER SET DEVELOPMENT a. Add excellent linguists from four or five linguistic subdisciplines (1-2. languages of the world and/or typology, 3. sociolinguistics, 4. phonetics, and 5. writing systems) to your character set directorships. DISCUSSION: The most well known research agency on linguistic demographics and emerging literacy, the SIL (Summer Institute of Linguistics, Texas, http://www.sil.org) suggests that there are 6,809* languages or mutually un- intelligible dialects in the world. Today probably only 40-50% of these languages are written in any script (personal email from SIL staffer, April 2003). Of these languages or major dialects, possibly 18-20 use CJKV-type** character sets. That leaves ~6,780 which use (or will use when written) an alphabetic or syllabic script, whether Roman, Arabic, Indic, Ethiopic, Slavic, N'ko*** or other. However, existing Unicode documentation lists only 19 Arabic-script languages out of a total of possibly 200. It contains a completely full Ethiopic section (no blank spaces) listing only a few potential Ethiopic-script languages. The entire Unicode documentation (the 3.0 book) lists less than 150 language names. In addition, leading phoneticians (UCLA, Berkeley) suggest that the world's languages contain 900 different consonants and 300 different vowels. How many of these sounds are covered by variant letter forms within Unicode, in any script? Does Unicode cover the "tip of the iceberg" linguistically? Does it cover 1/3 of languages now? No one knows. *Statistics are from about 1999. They are compiling newer statistics now. **Unicode / ISO acronym for "Chinese, Japanese, Korean and Vietnamese" characters. ***N'ko is an emerging, mostly Arabic-based West African script.