Re: Mongolian Unicoding (was Re: Cuneiform Free Variation Selectors)

From: Asmus Freytag (
Date: Tue Jan 20 2004 - 03:36:54 EST

  • Next message: Jon Hanna: "Re: Cuneiform Free Variation Selectors"

    Just a few comments on Andrew's note:

    At 06:43 AM 1/19/2004, Andrew C. West wrote:
    >An analogy for those not familiar with the Mongolian script is the much
    >long s, which is a positional glyph variant of the ordinary letter s for some
    >languages at some periods of time. The long s does not need to be encoded as a
    >separate character as there are well-known rules for when an s should be
    >long and when it should be written short (although these rules may vary from
    >locale to locale and from time to time).

    There are 'rules' and 'rules'. In general, if there are simplistic rules,
    like the
    ones Andrew suggests below, software can (and should) be used to make the glyph
    selection. However, some languages have rules that are based not on the
    character, but on the meaning and use of the word in question. That raises the
    complexity of the task to the level of hyphenation or worse, and it's better to
    let the user make the decision from the start.

    > If, for example, the rule for a given
    >locale is short s finally and medially after another s, and long s
    >initially and
    >medially except after another s, then the user could type in a word using the
    >ordinary letter s throughout, and the rendering system would select the
    >long or
    >short s glyph as appropriate depending on its position within the word.
    >But say
    >that the user wanted to go against the rendering rules, and write a long s
    >in a
    >position that is normally rendered as a short s, or if he wanted to refer
    >to the
    >long s in isolation, then this is where an FVS would come in. The FVS could be
    >applied to the letter s to override its normal glyph shape, and force a long s
    >even where the rules state that it should be a short s (and vice versa for

    Currently, Variation Selectors work only one way. You could 'force' one
    shape. Leaving the VS off, gives you no restriction, leaving the software free
    to give you either shape. W/o defining the use of two VSs you cannot 'force'
    the 'regular' shape. Also, the way most VSs are defined, their use does not
    on context the same way as the example suggests.

    >Now the Latin alphabet only has this one example (as far as I know) of a
    >that has positional or contextual variant forms, and so it is simpler to just
    >encode the long s separately. However, almost every letter in Mongolian
    >and its
    >related scripts has at least two positional and/or contextual forms, and some
    >letters have up to four or five glyph forms. Encoding all the various glyph
    >forms of each letter separately would be an unecessary burden on the user, who
    >would have to manually select the correct glyph form for each letter even
    >they are conceived of as the same letter. It is far simpler (for the
    >end-user at
    >least) to let the rendering engine apply a set of rules to determine which
    >form is required in which position (isolate, initial, medial or final) or in
    >which context (e.g. in "feminine" or "masculine" words). As Asmus pointed out
    >the Mongolian FVSs would normally only be needed to override the rules, for
    >example to display a particular glyph form in isolation (e.g. in
    >metalanguage descriptions
    >of the Mongolian script), or to write foreign words (which in Mongolian
    >typically use unexpected glyph forms for certain letters); and so in normal
    >running text with no foreign words the user would rarely need to use an
    >FVS (and
    >with a good IME the user probably wouldn't even need to know of their

    The main difference between Latin and (positional) shaping in Arabic and
    and gender context) shaping in Mogolian is the fact that the rules are
    and based on (nearby) context. All cases that don't follow the
    deterministic rules
    must be marked by the user with appropriate characters. For example, the use of
    ZWNJ to interrupt cursive connectedness in Arabic.

    A VS approach is potentially indicated when its necessary to manually select
    non-deterministic variants (or to override deterministic ones) and at the same
    time it's desired to use the same base character code to carry the same
    base meaning
    all the time (which the long s does not do). Long s and final sigma in
    Greek can
    be handled as exceptions by software, like text to speech, that needs to
    know about
    the 's'-ness of the character independent of its shape. It's possible,
    since there
    is only a single exception in the script.

    One requirement for the use of variation selectors is that the script otherwise
    is a 'complex' one. Complex scripts have specific layout rules that software
    needs to support and different software packages already need to agree on
    these rules in a closely similar way, otherwise no documents could be
    (this is true for the pseudo script Mathematical Notation, which also uses VS).

    Without such agreements, the results would be unpredictable. For Andrew's
    since most Latin-based software does not support shaping rules for the long s,
    documents that relied on exchanging it, would have to either mark up *every*
    occurrence with a VS or risk being non-interchangable. At that point, coding
    a separate character makes more sense.

    Chinese ideographs don't quite fit either Andrews example or my reply - the
    of the problem is different due to both the large set of base characters and
    the (potentially) large number of (non-deterministic) variations --
    together with
    the fact that ignoring the variation in display and processing while retaining
    information about it in the code might the hing to do. (None of the other
    have those sorts of issues).


    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 05:25:24 EST