Re: Tibetan Unicode / ISO10646 BMP comments (fwd)

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Thu Jul 25 1996 - 11:09:56 EDT


Some comments on the following comments:

> COMMENTS BY ACIP ON UNICODE /ISO10646 BMP ENCODING OF TIBETAN
> (July 1996)
>
> Prepared by: Robert Chilton, Technical Manager
> The Asian Classics Input Project (ACIP)

(large parts deleted)
> ACIP HAS TWO MAIN CONCERNS REGARDING UNICODE TIBETAN:
> 1. That the glyph registry be complete enough to encode all of our
> database.

[Just as an asside, this should obviously read "character repertoire"
and not "glyph registry".]

> 2. That the structure of the code table support lexical processing,
> e.g., conversion and sorting, of our materials. Although sorting is not
> a Unicode concern, it is of vital importance to indexing tools and to
> the work of librarians and bibliographers. Where simple steps can be
> taken to support sorting, such measures should be adopted.

Given the overall complexity of the sorting task as evident from the
description of the sorting algorithm in an accompaining mail,
the simplification that can be achieved by whatever changes on
the character encoding level seem to be relatively minimal.

> ACIP strongly recommends:
>
> a. A code position must be reserved for the invisible inherent vowel A
> just prior to the lengthened vowel A (position 0F70 in the SEPT 95
> proposal).
>
> b. One code position each must be reserved in both full and subscribed
> letter sequences between JA and NYA (positions 0F48 and 0F98 in the SEPT
> 95 proposal) for Sanskritic recode from DZHA.
>
> c. The consonant and vowel series should remain in (mostly) Sanskritic
> order, as in the SEPT 95 proposal, since such ordering greatly
> facilitates sorting in Sanskrit order and has no affect on sorting in
> Tibetan. It is very helpful to have most or all non-alphabetic (non-
> lexical) glyphs encoded in code positions prior to KA. The vowels
> should maintain their current position--following the full letters and
> prior to the subscribed letters. A constant offset between the full
> letters and their subscribed counterparts should be maintained.

These three steps would only eliminate a single one-to-one mapping
from Unicode values to some other list of values.

> ACIP observes and suggests:
>
> d. Lexical processing of Dzongkha (Bhutanese) will be greatly
> facilitated by the addition of an invisible TSEG (to mark the end of a
> lexical unit).

If a "lexical unit" is something like a word, probably the zero-width space
should/could be used here.

> e. An invisible tag marking the boundary between the lexical prescript
> and the lexical root will likely be inserted during lexical processing;
> Unicode may wish to define this code position explicitly rather than
> leaving it up to the various applications developers to define, each in
> their own way perhaps.

Is this something that remains in the text (such as SHY and the like)?
Then this proposal makes sense. If not, then this codepoint seems not
necessary.

> f. Given the likelihood of additional glyphs appearing after adoption of
> the current proposal, Unicode may wish to leave empty code positions
> prior to the full letter KA (or else prior to the first encoded Tibetan
> character). ACIP does not understand why the empty code positions
> follow full letter KSHRA since it is not likely that many new quasi-
> alphabetical (lexical) glyphs will be proposed for inclusion. It seems
> sensible to shift the entire letter sequence of KA through KSHRA down
> six code positions, thus freeing up code positions prior to KA.
> Similarly, it may be preferable to shift the entire alphabetic (lexical)
> section--consisting of the two consonant sequences and the intervening
> vowel & sundry sequence--to the end of the reserved code space, thus
> freeing up more open code positions prior to the lexical characters.

This is again a single one-to-one map prior to sorting. That and much
more is done for most other scripts.

> 3. MINOR ISSUES.

> d. Blank space in Tibetan obeys very different conventions from blank
> space in most roman scripts. ACIP wonders if it would be useful and
> appropriate to define a Tibetan version of blank space--which is more
> properly BLANK or GAP or HORIZONTAL SEPARATION--since, unlike
> conventional <SPACE>, it is not really unitary nor additive.

There are already quite a few spaces :-). Maybe there are other languages
and occasions where a Tibetan space-like phenomenon is occurring.
But I don't understand the exact difference, maybe somebody can
give some more information?

Regards, Martin.

----
Dr.sc.  Martin J. Du"rst			    ' , . p y f g c R l / =
Institut fu"r Informatik			     a o e U i D h T n S -
der Universita"t Zu"rich			      ; q j k x b m w v z
Winterthurerstrasse  190			     (the Dvorak keyboard)
CH-8057   Zu"rich-Irchel   Tel: +41 1 257 43 16
 S w i t z e r l a n d	   Fax: +41 1 363 00 35   Email: mduerst@ifi.unizh.ch
----



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT