Mongolian Encoding

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Mon Dec 16 2002 - 08:40:22 EST

  • Next message: Martin Heijdra: "Re: Mongolian Encoding"

    As promised, here are some questions on the encoding of Mongolian that have
    arisen whilst writing an input method for the Mongolian script (the questions
    are relevant to the Todo, Manchu and Sibe scripts as well, but I'll restrict
    myself to Mongolian for the moment). I don't know if anyone is able to answer
    all of my questions, but I hope that someone on the list will be able to give me
    some much needed advice.

    1. Documentation
    Section 11.4 of the Unicode Standard notes that a group of experts from
    Mongolia, China and the West are to publish a document called "User's Convention
    for System Implementation of the International Standard on Mongolian Encoding"
    which will explicitly define Mongolian character shaping behaviour in full. WG2
    document N1980 (http://std.dkuug.dk/jtc1/sc2/WG2/docs/n1980.doc) also states
    that Mongolian, Chinese and English versions of the "User's Convention" will be
    prepared by Mongolia and China. I have been unable to locate this document on
    the internet. Does it exist, and if so can it be made publicly available ?
    Without the aid of such a document it seems almost impossible to correctly
    implement the Unicode encoding of Mongolian.
    In its stead I have been using the document "Traditional Mongolian Script in the
    ISO/IEC 19646 and Unicode Standards" (UNU/IIST Report No. 170, August 1999)
    written by Myatav Erdenechimeg, Richard Moore and Yumbayar Namsrai as a guide to
    Mongolian character shaping behaviour. It seems to provide all the information I
    would expect to see in the "User's Convention", but I am not sure how
    authoritive this paper is, and what its relationship is to the "User's
    Convention" (if any).

    2. Free Variation Selectors
    The Mongolian Free Variation Selectors (U+180B, U+180C and U+180D) are used to
    distinguish variant graphic forms of the same positional forms of a character. I
    would say that there are three cataegories of variant forms governed by the
    variation selectors :
    A. Non-contextual variants, such as variant forms of letters that are used in
    foreign words (e.g. the use of a "reclining" letter D -- U+1833 + FVS1 -- in
    foreign words), and graphic variations that are due to differences between
    traditional and modern orthography. Such variants must be explicitly encoded by
    use of the appropriate variation selector in order for the correct form to be
    selected by the rendering engine.
    B. Contextual variants that are determined by the overall composition of the
    word in which they are found, such as the use of the long-toothed forms of the
    letters OE and UE (U+1825/1826 + FVS1) in the first syllable of a word only, or
    the use of the feminine form of the letter G (U+182D + FVS3) between consonants
    or the letter I (which is neutral) in a feminine word. In these cases I would
    imagine that it is too much to ask the rendering engine to work out the correct
    variant form, and the correct variant should be explicitly encoded using the
    appropriate variation selector.
    C. Contextual variants that can be determined from their neighbouring letters,
    such as the medial form of the letter G with two dots that is used before a
    vowel (U+182D + FVS2), or the form of the letter A that is written with a
    forward tail when occuring finally after the letters B, P, F and K (U+1820 +
    FVS1). In these cases is it necessary to explicitly encode the variant form with
    the appropriate variation selector ? The Standard says "For cases in which the
    contextual sequence of basic letters is not sufficient for a rendering engine to
    uniquely determine the appropriate glyph for a particular letter, additional
    format characters are provided so that the typist may specify the desired
    rendering". Should we assume that the rendering engine will correctly select the
    dotted form of medial G before a vowel and the dotless form before a consonant,
    or would it be wiser to explicitly encode the appropriate variation selector
    anyway ?

    3. Mongolian Vowel Selector
    The Mongolian Vowel Selector (U+180E) is used to separate the vowels A and E
    from certain preceding consonants (e.g. ...N + MVS + A = U+1828,180E,1820 ).
    After MVS the vowels A and E use the forward tail variant which is physically
    offset from the preceding consonant by narrow whitespace. These variant forms of
    A and E are selected by the presence of a preceding MVS, and there appears to be
    no need to to otherwise select the variant A or E by means of a variation
    selector.
    However, not only does the MVS affect the following A or E, but the preceding
    consonant may also take a variant form when followed by an offset A or E. This
    is the case for the letters N, Q, G, J, Y and W. The variant forms of these
    letters when preceding an offset A or E are given in Unicode's Standardized
    Variants document (N, Q, G, J and Y are given as medial variants, but W is given
    as a final variant which is perhaps wrong). My question is, should the variant
    form of the consonant preceding the offset A or E be explicitly encoded using
    the appropriate variation selector, or is the presence of the following MVS
    sufficient for the rendering engine to select the correct variant form ?

    4. Variant forms of the Mongolian Birga
    Appendix A of "Traditional Mongolian Script in the ISO/IEC 19646 and Unicode
    Standards" lists four variant forms of the Mongolian Birga (U+1800) :
    1st variant form = U+1800 + FVS1
    2nd variant form = U+1800 + FVS2
    3rd variant form = U+1800 + FVS3
    4th variant form = U+1800 + ZWJ

    Unicode's Standardized Variants document
    (http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html) does not list
    any variants for the Mongolian Birga. Moreover, it warns "All combinations not
    listed here are unspecified and are reserved for future standardization; no
    conformant process may interpret them as standardized variants." This clearly
    means that these Birga variants should not currently be recognised. But given
    that the Birga does occur in a number of forms, either Unicode should define standardized
    variants for them, or add some new characters to represent them.
    Nevertheless, assuming that Appendix A of "Traditional Mongolian Script" is
    correct in providing a mechanism for distinguishing four variant forms of the
    Mongolian Birga, is it acceptable to use the ZWJ as a variant selector (as is
    the case for the 4th variant Birga) ? It's usage here seems a little suspect to
    me.

    Andrew



    This archive was generated by hypermail 2.1.5 : Mon Dec 16 2002 - 09:32:17 EST