Re: Tibetan Unicode / ISO10646 BMP comments (fwd)

From: Asian Classics Input Project (acip@well.com)
Date: Thu Jul 25 1996 - 05:09:50 EDT


Forwarded message follows:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

 APPENDICES TO COMMENTS BY ACIP ON UNICODE / ISO 10646 BMP
 ENCODING OF TIBETAN (July 1996)
 
 Prepared by: Robert Chilton, Technical Manager
                The Asian Classics Input Project (ACIP)
 
 Note: In this document, Tibetan is transliterated as per
       ACIP convention.
 
 APPENDIX A: COMMENTS ON SORTING UNICODE / ISO 10646 TIBETAN
 
 Abstract: Candidate sort orders for Tibetan include:
 Choegay, Sanskritic, and Dzongkha (the national language of
 Bhutan). Choegay sort order can be accomplished for mixed
 standard and non-standard orthographies by identifying the
 prescript, if any, and then applying a conventional three
 level sort. The three levels are: alphabetic, diacritical
 variations, and case variations.
 
 There are a number of alternative sort orders that might be
 applied to materials written in Tibetan script. The most
 common order is standard Tibetan (Choegay) sort sequence.
 Sort sequences of standard Tibetan in existing dictionaries
 follow a common set of rules with minor variations (mostly
 seen in the treatment of non-standard orthographies, e.g.,
 foreign loan words). Another likely sort order is
 Sanskritic as in, for example, a Sanskrit-Tibetan
 dictionary. There are two main variations here: (1) strict
 Sanskritic order (vowels first) and (2) modified Sanskritic
 order (vowels last). A third possible sort order is for
 the national language of Bhutan: Dzongkha.
 
 The focus here will be on standard Tibetan (Choegay). It
 is important to distinguish the different sort orders from
 each other and not presume that "correct" Tibetan sort
 order will somehow combine all possible sorting sequences
 within a single ordering scheme. For example, since all
 Sanskritic retroflex consonants and vowels can be
 represented in Tibetan script, one might think that the
 consonant and vowel sequence in sorted Tibetan should
 follow the Sanskrit order. Further analysis, however, will
 show that such an assumption is unfounded.
 
 As described here, standard Tibetan is seen to consist of
 30 consonants and 5 vowels (one inherent vowel plus four
 marked vowels). Consequently, the reversed letters and
 long vowels are treated as variant forms of their usual
 forms. In particular, with the exception of the anomaly of
 prescript letters, for sorting purposes, Tibetan can be
 treated in the same manner as other languages, such as
 those written in roman script.
 
 In roman script languages, the sorting must be done at (a
 minimum of) three levels. The highest level is the basic
 alphabetic order: A, B, C, etc. At the next level, the
 diacritic letters must be sorted vis-a-vis the
 non-diacritic letters: A, A-umlaut, A-macron, etc. At the
 third level, the sequences which are the same except for
 the case of the letters must be sorted relative to each
 other: A, a and BE, be etc.
 
 Applying this schema to standard Tibetan (Choegay), the
 alphabetic order is the usual Tibetan order: KA, KHA, GA,
 NGA, etc. At the second, diacritical, level, there are at
 least 4 diacritical variations: (1) long vowels;
 (2)standard wazur*; (3) tsamdru--fricative mark; (4)
 variants of the bindu--candra+bindu and candra+bindu+nada.
 At the third level, analogous to variation in case, 4
 categories have so far been identified: (1) the reverse
 letters and vowels; (2) non-abbreviated forms of subscribed
 letters WA, YA, RA; (3) the bindu as an alternate form of
 MA; (4) the virama (SROG MED) as an alternate form of the
 inherent vowel.
 
 [*note: standard wazur refers to a wazur on a standard Tibetan
 initial; not to be confused with the abbreviated subscribed
 WA (wazur) seen in non-standard Tibetan, which is identical
 in form, but carries a different lexical meaning.]
 
 Because standard Tibetan consists of only 30 letters plus 5
 vowels, certain Sanskritic single letters must be
 decomposed to their conjunct Tibetan form prior to sorting.
 Thus, GHA becomes GA plus subscribed HA and similarly for
 dHA, DHA, BHA, DZHA, and KshA. In the same way, vocal Ri
 and Li are decomposed to RA-subscript or LA-subscript plus
 vowel i, and similarly for R'i and L'i.
 
 Since Tibetan, even non-standard Tibetan, can be handled in
 a manner that is similar to other languages, it can easily
 be sorted once the prescript has been properly handled,
 Unfortunately, handling the prescript can be a rather
 complex matter. There are two major tasks that must be
 accomplished: (1) the inherent vowel must be inserted where
 needed and (2) the prescripts must be identified, revalued
 (higher than the subscripts), and relocated to a position
 in the datastream following the root. Both of these steps
 require checking against a list of valid standard Tibetan
 initials and, perhaps, finals.
 
 There may be some benefit in leaving the invisible inherent
 vowel in the datastream, once inserted. There may also be
 some benefit in providing for an invisible prescript-mark
 that would appear following the prescripts to identify them
 as such although this identification can also be made using
 the inherent vowel and referring to a list of standard
 Tibetan initials. Having the prescripts marked explicitly
 would eliminate the need for further reference to the list
 of standard Tibetan initials.
 
 
 APPENDIX B:
 ALGORITHM FOR CONVENTIONAL TIBETAN (CHOEGAY) SORT ORDER
 
 Abstract: A sketch algorithm is presented for ordering
 both standard and non-standard (i.e., Sanskritic and other
 foreign-origin) orthographies within a single sort sequence
 that follows Choegay sort order.
 
 Assumes that the data stream is normalized with respect
 to the relative order of consonants and modifiers (e.g.,
 explicit vowels, bindus) in the data stream.
 
 Steps required in sorting Unicode Tibetan in standard
 (Choegay) order:
 
 1. Identify the lexical unit: range check for non-lexical
     codes --i.e., that act as lexical unit delimiters.
     Examples: TSEG, blank, SHAD, etc.
 
 2. Decompose complex consonants and vowels (GH dH DH BH
     DZH Ri R'i Li L'i) into their component parts.
 
 3. Identify the primary initial of the lexical unit
 
      a. match the beginning of the lexical sequence against
         a list of standard Tibetan initials (including standard
         vowel); if non-standard, then decompose to the closest
         standard initial, sans prefix. The initial is either the
         1st column or, for standard orthographies that include a
         prefix, the 1st and 2nd column. This step requires
         insertion of the inherent vowel A where needed.*
 
      b. if the initial is standard Tibetan then:
         -- if there is a simple prescript, exchange places
         between the root and the prescript and recode the
         prescript high (above the subscripts).
         -- if there is a complex prescript, place the root
         before both prescripts and recode the prescripts as
         special BA prefix plus head prescript, both coded high.
         -- else if no prescript, no recoding necessary at
         this step.
 
      c. if the initial is non-standard then:
         -- if RA-, LA-, SA-head prescript, then exchange
         places between root and prescript and recode the
         prescript high.
         -- else if no prescript, no recoding necessary at
         this step.
 
 4. Ignore/conflate diacritics but maintain
      presence/distinction at 2nd level:
      a. ignore tsamdru (fricative flag)
      b. ignore wazur on standard initial
      c. conflate long vowels to their standard counterpart
      d. conflate complex bindus to simple bindhu
 
 5. Conflate alternate forms (treat as change of case) but
      maintain distinction at the 3rd level:
      a. conflate reverse letters and vowel to their
          non-reversed counterpart
      b. conflate alternate forms of subscripts WA YA RA
      c. conflate bindu to MA
      d. conflate virama to inherent vowel A (perhaps delete
          adjacent inherent vowel A, if one is present?)
 
 6. Identify secondary vocal units, if any --indicated by
     any of the following:
          vowel-marked suffix,
          subscribed suffix,
          non-final "a-chung",
          multiple suffixes other than standard finals.
     Insert the inherent vowel A where needed.
 
 7. Sort in three levels (passes):
      1st. alphabetical order: A B C
           => KA KHA GA and KA BKA RKA
      2nd. diacritics: A A: (where A: is an A with an umlaut)
           => KA K'A and KA KVA, etc.
      3rd. variance in case: A a
           => TA tA and KRI KRi and YYA Y+YA, etc.
 
  *Note that the inherent vowel is easily inserted in cases
 of a complex 1st column--composed of a full consonant
 subscribed by one or more subscribed letters--or a complex
 2nd column: the inherent vowel is inserted immediately
 after the last subscribed letter. When a complex 2nd
 column appears in a non-standard initial, an additional
 inherent vowel is inserted after the 1st column full
 consonant. Another simple case is seen when the entire
 lexical unit consists of only 2 full consonants: the
 inherent vowel is inserted between them. When a lexical
 unit begins with 3 or more full consonants, the analysis
 becomes more complex since a truly robust system must
 handle both standard Tibetan orthographies and non-standard
 orthographies. In these latter cases, lists of standard
 initials and finals must be referenced.
 
 ------------------------------------------------------------------
 Robert R. Chilton, Technical Manager
 The Asian Classics Input Project (ACIP)
 New York Area Office: 47 East Fifth Street Howell, NJ 07731
 Tel: 908-364-1824 Fax: 908-901-5940 Email: acip@well.com
 -------------------------------------------------------------------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

-- 
Christopher J Fynn                   <cfynn@sahaja.demon.co.uk>
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT