Tibetan Unicode / ISO10646 BMP comments (fwd)

From: Asian Classics Input Project (acip@well.com)
Date: Tue Jul 23 1996 - 16:50:33 EDT

Forwarded message follows:

                             (July 1996)
  Prepared by: Robert Chilton, Technical Manager
                 The Asian Classics Input Project (ACIP)
  INTRODUCTION. Since its inception in 1987, ACIP has input over 1000
  titles of classical Tibetan literature, currently totaling some 45,000
  pages of text. ACIP's computerized Tibetan language database is by far
  the largest in the world. ACIP has also created catalogs of Tibetan-
  language materials, most notably a catalog of the Russian Academy of
  Science's massive collection in St Petersburg--which, at 34,000 entries,
  is now about one-fifth completed. ACIP's director, Mr. Michael Roach,
  has served as a consultant to the U.S. Library of Congress on matters
  concerning Tibetan language materials.
  Overall, ACIP is pleased with the Unicode proposal of September 1995
  (N1255, PDAM-6). Given the number and scope of changes made and
  proposed during the lead-up to the JTC1/SC2/WG2 meeting in June 1995,
  congratulations are due to all participants for settling on a thoroughly
  reasonable proposal.
  ACIP wonders whether this proposal is being rushed too quickly. Many of
  the Tibetan experts we know have only recently seen the current
  proposal; and they have no obvious means by which to make their views
  known. Some period of public comment by experts in the field seems
  appropriate. Perhaps a second PDAM for Tibetan is prudent?
  That said, ACIP reaffirms our view that Tibetan as presented in document
  N1255 is generally adequate for our purposes of encoding and processing
  classical Tibetan (Choegay), but with reservations as noted below.
  1. That the glyph registry be complete enough to encode all of our
  2. That the structure of the code table support lexical processing,
  e.g., conversion and sorting, of our materials. Although sorting is not
  a Unicode concern, it is of vital importance to indexing tools and to
  the work of librarians and bibliographers. Where simple steps can be
  taken to support sorting, such measures should be adopted.
  a. The non-abbreviated forms of subscribed WA, YA, and RA should be
  encoded, but separate from the sequence of normal subscribed forms.
  Within the sequence of subscribed letters, subscribed WA, YA, and RA
  should appear in their normal abbreviated forms (wazur, yata, rata).
  Rationale: Both abbreviated and non-abbreviated forms of these
  subscribed letters can appear in the same document (ACIP can provide
  examples). When subscribed to RA, ACIP encodes these pairs as RVA, RYA,
  RRA (abbreviated forms) and RWA, R+YA, R+RA (non-abbreviated forms). We
  note that R+Y+YA appears with some frequency.
  b. Other glyphs: In comparing the current proposal (PDAM 6, SEPT 95)
  with past proposals, it appears that two glyphs were (inadvertently?)
  omitted: the triple-x ("TIBETAN SIGN THREE DENA") and the large-X
  ("TIBETAN MARK KURUKA"). Two additional glyphs are candidates for
  encoding: the dachey (crescent moon) and the nada (flame). These are
  explained as distinct lexical elements in, for instance, the Mongolian
  national symbol / Kalachakra symbol and might well be written separately
  in such explanations. These glyphs are well known and thus no
  illustration is necessary.
  2. LEXICAL PROCESSING CONCERNS. Conversion of Tibetan materials from
  existing formats to Unicode Tibetan; and sorting of Tibetan in Tibetan
  sort order and Sanskrit sort order can be achieved with a minimum of
  difficulty if the following provisions are met:
  ACIP strongly recommends:
  a. A code position must be reserved for the invisible inherent vowel A
  just prior to the lengthened vowel A (position 0F70 in the SEPT 95
  b. One code position each must be reserved in both full and subscribed
  letter sequences between JA and NYA (positions 0F48 and 0F98 in the SEPT
  95 proposal) for Sanskritic recode from DZHA.
  c. The consonant and vowel series should remain in (mostly) Sanskritic
  order, as in the SEPT 95 proposal, since such ordering greatly
  facilitates sorting in Sanskrit order and has no affect on sorting in
  Tibetan. It is very helpful to have most or all non-alphabetic (non-
  lexical) glyphs encoded in code positions prior to KA. The vowels
  should maintain their current position--following the full letters and
  prior to the subscribed letters. A constant offset between the full
  letters and their subscribed counterparts should be maintained.
  ACIP observes and suggests:
  d. Lexical processing of Dzongkha (Bhutanese) will be greatly
  facilitated by the addition of an invisible TSEG (to mark the end of a
  lexical unit).
  e. An invisible tag marking the boundary between the lexical prescript
  and the lexical root will likely be inserted during lexical processing;
  Unicode may wish to define this code position explicitly rather than
  leaving it up to the various applications developers to define, each in
  their own way perhaps.
  f. Given the likelihood of additional glyphs appearing after adoption of
  the current proposal, Unicode may wish to leave empty code positions
  prior to the full letter KA (or else prior to the first encoded Tibetan
  character). ACIP does not understand why the empty code positions
  follow full letter KSHRA since it is not likely that many new quasi-
  alphabetical (lexical) glyphs will be proposed for inclusion. It seems
  sensible to shift the entire letter sequence of KA through KSHRA down
  six code positions, thus freeing up code positions prior to KA.
  Similarly, it may be preferable to shift the entire alphabetic (lexical)
  section--consisting of the two consonant sequences and the intervening
  vowel & sundry sequence--to the end of the reserved code space, thus
  freeing up more open code positions prior to the lexical characters.
  a. Some of the character names, such as the reversed letters, need
  editing. As a note, where Wylie transliteration (lowercase) uses tsa
  and tsha, ACIP transliteration (uppercase) uses TZA and TSA.
  b. For ease of processing during rendering, marks that apply to an
  entire syllable such as 0F35, 0F37, 0F86, 0F87 should be grouped
  together, in order to support range checking.
  c. ACIP does not understand the rationale behind encoding the
  precomposed characters at positions 0F00, 0F02, and 0F03.
  d. Blank space in Tibetan obeys very different conventions from blank
  space in most roman scripts. ACIP wonders if it would be useful and
  appropriate to define a Tibetan version of blank space--which is more
  properly BLANK or GAP or HORIZONTAL SEPARATION--since, unlike
  conventional <SPACE>, it is not really unitary nor additive.
  4. APPENDICES (available immediately upon request)
  Abstract: Candidate sort orders for Tibetan include: Choegay,
  Sanskritic, and Dzongkha (the national language of Bhutan). Choegay
  sort order can be accomplished for mixed standard and non-standard
  orthographies by identifying the prescript, if any, and then applying a
  conventional three level sort. The three levels are: alphabetic,
  diacritical variations, and case variations.
  Abstract: A sketch algorithm is presented for ordering both standard
  and non-standard (i.e., Sanskritic and other foreign-origin)
  orthographies within a single sort sequence that follows Choegay sort
  Robert R. Chilton, Technical Manager
  The Asian Classics Input Project (ACIP)
  New York Area Office: 47 East Fifth Street Howell, NJ 07731
  Tel: 908-364-1824 Fax: 908-901-5940 Email: acip@well.com

Christopher J Fynn                   <cfynn@sahaja.demon.co.uk>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT