Re: Tibetan Unicode / ISO10646 BMP comments (fwd)

From: Asian Classics Input Project (acip@well.com)
Date: Thu Jul 25 1996 - 05:09:50 EDT

Next message: Smita Desai: "RE: UC data"
Previous message: Asian Classics Input Project: "Tibetan Unicode / ISO10646 BMP comments (fwd)"
Maybe in reply to: Asian Classics Input Project: "Tibetan Unicode / ISO10646 BMP comments (fwd)"
Next in thread: Glenn Adams: "Re: Tibetan Unicode / ISO10646 BMP comments (fwd)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Forwarded message follows:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

APPENDICES TO COMMENTS BY ACIP ON UNICODE / ISO 10646 BMP
ENCODING OF TIBETAN (July 1996)

Prepared by: Robert Chilton, Technical Manager
                The Asian Classics Input Project (ACIP)

Note: In this document, Tibetan is transliterated as per
       ACIP convention.

APPENDIX A: COMMENTS ON SORTING UNICODE / ISO 10646 TIBETAN

Abstract: Candidate sort orders for Tibetan include:
Choegay, Sanskritic, and Dzongkha (the national language of
Bhutan). Choegay sort order can be accomplished for mixed
standard and non-standard orthographies by identifying the
prescript, if any, and then applying a conventional three
level sort. The three levels are: alphabetic, diacritical
variations, and case variations.

There are a number of alternative sort orders that might be
applied to materials written in Tibetan script. The most
common order is standard Tibetan (Choegay) sort sequence.
Sort sequences of standard Tibetan in existing dictionaries
follow a common set of rules with minor variations (mostly
seen in the treatment of non-standard orthographies, e.g.,
foreign loan words). Another likely sort order is
Sanskritic as in, for example, a Sanskrit-Tibetan
dictionary. There are two main variations here: (1) strict
Sanskritic order (vowels first) and (2) modified Sanskritic
order (vowels last). A third possible sort order is for
the national language of Bhutan: Dzongkha.

The focus here will be on standard Tibetan (Choegay). It
is important to distinguish the different sort orders from
each other and not presume that "correct" Tibetan sort
order will somehow combine all possible sorting sequences
within a single ordering scheme. For example, since all
Sanskritic retroflex consonants and vowels can be
represented in Tibetan script, one might think that the
consonant and vowel sequence in sorted Tibetan should
follow the Sanskrit order. Further analysis, however, will
show that such an assumption is unfounded.

As described here, standard Tibetan is seen to consist of
30 consonants and 5 vowels (one inherent vowel plus four
marked vowels). Consequently, the reversed letters and
long vowels are treated as variant forms of their usual
forms. In particular, with the exception of the anomaly of
prescript letters, for sorting purposes, Tibetan can be
treated in the same manner as other languages, such as
those written in roman script.

In roman script languages, the sorting must be done at (a
minimum of) three levels. The highest level is the basic
alphabetic order: A, B, C, etc. At the next level, the
diacritic letters must be sorted vis-a-vis the
non-diacritic letters: A, A-umlaut, A-macron, etc. At the
third level, the sequences which are the same except for
the case of the letters must be sorted relative to each
other: A, a and BE, be etc.

Applying this schema to standard Tibetan (Choegay), the
alphabetic order is the usual Tibetan order: KA, KHA, GA,
NGA, etc. At the second, diacritical, level, there are at
least 4 diacritical variations: (1) long vowels;
(2)standard wazur*; (3) tsamdru--fricative mark; (4)
variants of the bindu--candra+bindu and candra+bindu+nada.
At the third level, analogous to variation in case, 4
categories have so far been identified: (1) the reverse
letters and vowels; (2) non-abbreviated forms of subscribed
letters WA, YA, RA; (3) the bindu as an alternate form of
MA; (4) the virama (SROG MED) as an alternate form of the
inherent vowel.

[*note: standard wazur refers to a wazur on a standard Tibetan
initial; not to be confused with the abbreviated subscribed
WA (wazur) seen in non-standard Tibetan, which is identical
in form, but carries a different lexical meaning.]

Because standard Tibetan consists of only 30 letters plus 5
vowels, certain Sanskritic single letters must be
decomposed to their conjunct Tibetan form prior to sorting.
Thus, GHA becomes GA plus subscribed HA and similarly for
dHA, DHA, BHA, DZHA, and KshA. In the same way, vocal Ri
and Li are decomposed to RA-subscript or LA-subscript plus
vowel i, and similarly for R'i and L'i.

Since Tibetan, even non-standard Tibetan, can be handled in
a manner that is similar to other languages, it can easily
be sorted once the prescript has been properly handled,
Unfortunately, handling the prescript can be a rather
complex matter. There are two major tasks that must be
accomplished: (1) the inherent vowel must be inserted where
needed and (2) the prescripts must be identified, revalued
(higher than the subscripts), and relocated to a position
in the datastream following the root. Both of these steps
require checking against a list of valid standard Tibetan
initials and, perhaps, finals.

There may be some benefit in leaving the invisible inherent
vowel in the datastream, once inserted. There may also be
some benefit in providing for an invisible prescript-mark
that would appear following the prescripts to identify them
as such although this identification can also be made using
the inherent vowel and referring to a list of standard
Tibetan initials. Having the prescripts marked explicitly
would eliminate the need for further reference to the list
of standard Tibetan initials.

APPENDIX B:
ALGORITHM FOR CONVENTIONAL TIBETAN (CHOEGAY) SORT ORDER

Abstract: A sketch algorithm is presented for ordering
both standard and non-standard (i.e., Sanskritic and other
foreign-origin) orthographies within a single sort sequence
that follows Choegay sort order.

Assumes that the data stream is normalized with respect
to the relative order of consonants and modifiers (e.g.,
explicit vowels, bindus) in the data stream.

Steps required in sorting Unicode Tibetan in standard
(Choegay) order:

1. Identify the lexical unit: range check for non-lexical
     codes --i.e., that act as lexical unit delimiters.
     Examples: TSEG, blank, SHAD, etc.

2. Decompose complex consonants and vowels (GH dH DH BH
     DZH Ri R'i Li L'i) into their component parts.

3. Identify the primary initial of the lexical unit

      a. match the beginning of the lexical sequence against
         a list of standard Tibetan initials (including standard
         vowel); if non-standard, then decompose to the closest
         standard initial, sans prefix. The initial is either the
         1st column or, for standard orthographies that include a
         prefix, the 1st and 2nd column. This step requires
         insertion of the inherent vowel A where needed.*

      b. if the initial is standard Tibetan then:
         -- if there is a simple prescript, exchange places
         between the root and the prescript and recode the
         prescript high (above the subscripts).
         -- if there is a complex prescript, place the root
         before both prescripts and recode the prescripts as
         special BA prefix plus head prescript, both coded high.
         -- else if no prescript, no recoding necessary at
         this step.

      c. if the initial is non-standard then:
         -- if RA-, LA-, SA-head prescript, then exchange
         places between root and prescript and recode the
         prescript high.
         -- else if no prescript, no recoding necessary at
         this step.

4. Ignore/conflate diacritics but maintain
      presence/distinction at 2nd level:
      a. ignore tsamdru (fricative flag)
      b. ignore wazur on standard initial
      c. conflate long vowels to their standard counterpart
      d. conflate complex bindus to simple bindhu

5. Conflate alternate forms (treat as change of case) but
      maintain distinction at the 3rd level:
      a. conflate reverse letters and vowel to their
          non-reversed counterpart
      b. conflate alternate forms of subscripts WA YA RA
      c. conflate bindu to MA
      d. conflate virama to inherent vowel A (perhaps delete
          adjacent inherent vowel A, if one is present?)

6. Identify secondary vocal units, if any --indicated by
     any of the following:
          vowel-marked suffix,
          subscribed suffix,
          non-final "a-chung",
          multiple suffixes other than standard finals.
     Insert the inherent vowel A where needed.

7. Sort in three levels (passes):
      1st. alphabetical order: A B C
           => KA KHA GA and KA BKA RKA
      2nd. diacritics: A A: (where A: is an A with an umlaut)
           => KA K'A and KA KVA, etc.
      3rd. variance in case: A a
           => TA tA and KRI KRi and YYA Y+YA, etc.

  *Note that the inherent vowel is easily inserted in cases
of a complex 1st column--composed of a full consonant
subscribed by one or more subscribed letters--or a complex
2nd column: the inherent vowel is inserted immediately
after the last subscribed letter. When a complex 2nd
column appears in a non-standard initial, an additional
inherent vowel is inserted after the 1st column full
consonant. Another simple case is seen when the entire
lexical unit consists of only 2 full consonants: the
inherent vowel is inserted between them. When a lexical
unit begins with 3 or more full consonants, the analysis
becomes more complex since a truly robust system must
handle both standard Tibetan orthographies and non-standard
orthographies. In these latter cases, lists of standard
initials and finals must be referenced.

------------------------------------------------------------------
Robert R. Chilton, Technical Manager
The Asian Classics Input Project (ACIP)
New York Area Office: 47 East Fifth Street Howell, NJ 07731
Tel: 908-364-1824 Fax: 908-901-5940 Email: acip@well.com
-------------------------------------------------------------------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

-- 
Christopher J Fynn                   <cfynn@sahaja.demon.co.uk>
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Next message: Smita Desai: "RE: UC data"
Previous message: Asian Classics Input Project: "Tibetan Unicode / ISO10646 BMP comments (fwd)"
Maybe in reply to: Asian Classics Input Project: "Tibetan Unicode / ISO10646 BMP comments (fwd)"
Next in thread: Glenn Adams: "Re: Tibetan Unicode / ISO10646 BMP comments (fwd)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT