L2/03-261

   From: ekeown@student.umass.edu
   Sent: Tuesday, August 05, 2003 6:08 PM
   To: lrajchel@ansi.org; jkarp@ansi.org
   Subject: Fwd: letter, request for instructions
  
  
             Elaine Keown
             TEMPORARILY in Madison WI
  
   Dear Lisa Rajchel and Jessica Karp:
  
   I am resending the letter below.  It may have been lost in 
   cyberspace or perhaps you are both on a very delightful vacation.
  
   Prof. Kelly has written me an interesting and hopeful response.
  
   In case other Hebraists wish to participate in this protest, 
   where would they write you?
  
   I gather that computer standards work comes out of the NYC 
   office.  So I assume they could write to Ms. Rajchel with a copy 
   to Ms. Karp.
  
   Please inform me what the proper emails are.
   I am still working on this issue.  I also still don't have an 
   email for Dr. Arnold.
  
   I will be re-writing the letter in a somewhat different format 
   for broader use.
  
   Elaine Keown
   _________________________________________________
  
             Elaine Keown
             Madison WI
  
   Dear Lisa Rajchel and Jessica Karp:
  
   I enclose below a letter of complaint which I have only managed 
   to send to Prof. Kelly so far (via email).
  
   I am in the middle of moving and realized that it would not be 
   appropriate to send via email to Dr. Hurwitz (even if I had his 
   email).  I haven't managed to fax him yet.
  
   If others wish to join me in this protest, what is the 
   appropriate way to contact you (email, fax, semi-open letter on 
   the Internet with private code) and which people in your 
   organization should be written?
  
   Also, I found a George W. Arnold in Holmdel (via anywho.com), but 
   I am not 100% sure that it is the right person.
  
   Elaine Keown
   __________________________________________________
  
             En route to Madison, Wisconsin
   	  ekeown@student.umass.edu
   	  July, 2003
  
  
   Dr. Mark W. Hurwitz
   President and CEO, ANSI
  
   Dr. George W. Arnold
   Board of Directors, ANSI
  
   Dr. William E. Kelly
   Board of Directors, ANSI
  
    RE:	1) real change for the Unicode Hebrew Block
   	2) Linguistic oversight for character set standards development
  
   Gentlemen:
  
   As Dr. Hurwitz may recall, I have been doing research on the complete
   character set for Hebrew (all 3,200 years) since summer 1999.  It is a very 
   discouraging activity.
  
   The two existing Unicode/ISO 10646 Hebrew blocks contain perhaps
   45% of possible characters for Hebrew-script materials.  As written, the 
   code is not usable for computational Biblical Hebraists or for 
   many Jewish studies scholars.  I was the primary researcher in this work, although
   I do not (yet) have a Ph.D.  However, I was assisted, corrected, and guided by 
   two recognized scholars, one in medieval Italian-Jewish studies and
   another in Aramaic.   I also have been in contact with some of the world's leading  
   computational Hebraists or Aramaists. 
  
   1) I am writing to ask you to facilitate extraordinary action on
   behalf of Hebrew. Last year the Coptic Patriarch, the recognized representative for 
   Coptic, still used in church liturgy, asked the Unicode Technical 
   Committee to completely change their Coptic block.
  
   Such radical change is against UTC policy, but they did agree to move
   and completely redo the Coptic section of Unicode/ISO 10646.
  
   The Hebrew block needs improvements which cannot be done within 
   Unicode's "no change to the code" policy.  As written, it will 
   not even fully represent the Leningrad Codex, the most widely 
   used Biblical Hebrew manuscript. For over a decade, computational Biblical 
   Hebraists have struggled with Unicode Hebrew.
  
   We cannot use it and are not permitted to truly fix it.
   I discuss this issue at length in Attachment 1.
   _________________________________________________________
  
   Dr. Mark W. Hurwitz
   Dr. George W. Arnold
   Dr. William E. Kelly
   July, 2003
   Page 2
  
  
   2)   I suggest that you greatly improve your system for overseeing 
   character set development.  I have two suggestions:
  
   a.  Add excellent linguists from four or five linguistic 
   subdisciplines (1-2.languages of the world and/or typology, 3. sociolinguistics, 4. 
   phonetics, and 5. writing systems) to your various directorships.
  
   b.  Facilitate much closer interaction between academia and the 
   standards world.  Since the 1960s, there has been far too much 
   distance between serious academics--by which I mean research professors and Ph.D. 
   candidates--and people who actually develop standards.  A lot of earlier standards 
   work was done by librarians.  However, library software is less demanding than the 
   software needed today by computational linguists.
  
   I discuss a) in Attachment 2.
  
   On a personal level, I waited over a decade now, in the American 
   version of poverty, for Unicode Hebrew to change.  I need to use 
   Unicode for my own future corpus linguistics work.  It is not possible
   for me to use the existing Unicode Hebrew block.
  
   Sincerely,
  
  
   Elaine Keown
  
   cc:	Dr. Seth Jerchower, Center for Advanced Judaic Studies Library, UPenn
   	Prof. Paul Flesher, Dept. of Religious Studies, University of Wyoming
        Prof. Michael O'Connor, Dept. of Semitics, Catholic University of America
        Prof. Kirk Lowery, Westminster Hebrew Institute, Westminster Theological Seminary, 
        Philadelphia
   	Other interested Hebraists	
   __________________________________________
  
   ATTACHMENT 1:  COMPUTATIONAL HEBREW:  HISTORY, DIFFICULTIES
  
   Hebrew was computerized early and was probably the second 
   language to which IR was applied (early 1960s, Bar Ilan University and Weizmann
   Institute).
  
   However, the Hebrew which was computerized early was the great 
   body of Hebrew literature without extra vowel, special punctuation, and text
   critical marks.
  
   Most such marked ("pointed") Hebrew consists of siddurim, machzorim,
   tikkunim, literature for students, and Biblical manuscripts.  Religious poetry
   (piyyutim) and other manuscript genres sometimes also use "pointing."
  
   The marked Hebrew was computerized by a series of Biblical 
   scholars since the 1960s in France.  Most of the computational work, especially by 
   Protestants and Catholics, was done on a single Biblical manuscript, the Leningrad 
   Codex. However, none of these scholars, whether they worked in Italy, France, Israel, 
   the U.S. or elsewhere, collaborated with a public standards body.
  
   In the late 1980s, when Unicode was starting up, some scholars tried
   very hard to persuade the UTC to pay attention to their private, academic
   code for Leningrad.  Apparently they were ignored by the UTC.  In Israel, the
   standards people in Tel Aviv were apparently unaware of all such academic work
   on Hebrew.
  
   So the Unicode Hebrew block was produced in isolation from 
   previous academic Biblical Hebrew research.  We have been told repeatedly by UTC 
   members that the block cannot be corrected or changed, that it is set in concrete.
  
   The existing Unicode Hebrew block has at least the following problems:
  
     1. The block is too small for the complete character set.  Now Hebrew is 
   in two separated blocks and it may end up in three.
     2. The basic collation will not sort optimally even for 
   unpointed and pointed (Tiberian) Hebrew, much less for the 30-40 other Hebrew-script 
   languages. Hebrew, like other caseless scripts, collates more easily than 
   the more modern scripts with case (e.g., Greek, Latin, Armenian).
     3. The subsets within the block are badly grouped.
   
    *4. Only 3 Hebrew-script languages are partially covered, out of a
   possible 30-40.
   *5.   The block is missing symbols needed for Leningrad and other 
   critical codices (Aleppo, London, Cairo, the Qumran Isaiah).
   *6.  Some symbols are conflated and need semantic differentiation
   for optimal usage in IR.
  
   *The starred items are easily fixable, at least compared to the 
   unstarred ones.
   _____________________________
  
   ATTACHMENT 2:   BETTER OVERSIGHT FOR CHARACTER SET DEVELOPMENT
  
   a.  Add excellent linguists from four or five linguistic 
   subdisciplines (1-2. languages of the world and/or typology, 3. sociolinguistics, 4. 
   phonetics, and 5. writing systems) to your character set directorships.
  
   DISCUSSION:   The most well known research agency on linguistic 
   demographics and emerging literacy, the SIL (Summer Institute of Linguistics, 
   Texas, http://www.sil.org) suggests that there are 6,809* languages or 
   mutually un- intelligible dialects in the world.  Today probably 
   only 40-50% of these languages are written in any script 
   (personal email from SIL staffer, April 2003).
  
   Of these languages or major dialects, possibly 18-20 use
   CJKV-type** character sets.  That leaves ~6,780 which use (or will use when written) an 
   alphabetic or syllabic script, whether Roman, Arabic, Indic, Ethiopic, Slavic,
   N'ko*** or  other.
  
   However, existing Unicode documentation lists only 19 Arabic-script languages out 
   of a total of possibly 200. It contains a completely full Ethiopic section (no blank spaces) 
   listing only a few potential Ethiopic-script languages.  The entire Unicode documentation 
   (the 3.0 book) lists less than 150 language names.
  
   In addition, leading phoneticians (UCLA, Berkeley) suggest that 
   the world's languages contain 900 different consonants and 300 different
   vowels. How many of these sounds are covered by variant letter forms within 
   Unicode, in any script?
  
   Does Unicode cover the "tip of the iceberg" linguistically?  Does it cover 1/3
   of languages now?
  
   No one knows.
  
   *Statistics are from about 1999.  They are compiling newer
   statistics now.
   **Unicode / ISO acronym for "Chinese, Japanese, Korean and
   Vietnamese" characters.
   ***N'ko is an emerging, mostly Arabic-based West African script.