Re: Printing and Displaying Dependent Vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Mar 26 2004 - 13:12:01 EST

  • Next message: Philippe Verdy: "Re: Printing and Displaying Dependent Vowels"

    From: "Antoine Leca" <Antoine10646@leca-marti.org>
    > It seems many are thinking about the section in 2.10, titled "Spacing Clones
    > of European Diacritical Marks". I read it as applying to diacritical marks
    > (and perhaps only European ones, but the distinction looks like blurry to
    > me). Beginning of 2.10 makes quite clear that diacritics is only one class
    > (the most important, though) of combining characters. Indic dependent vowels
    > are another.

    I answered to you by saying "diacritics or vowel signs", but yes it also
    includes dependant vowels when they are used to create what is more generally
    called "default grapheme clusters" which is a larger set than the set of
    "combining sequences" (made of a base character followed by combining
    characters).

    Indic scripts are a bit unique by the fact that they have a syllabic structure
    decomposed into separate letters with a base consonnant and a "combining" (this
    is not the proper term for Unicode) vowel modifier after it. This differs from
    European alphabets (Latin, Greek, Cyrillic) or even from some Asian or African
    syllabaries (notably Hiragana/Katakana) where these grapheme clusters are
    (almost always) combining sequences are coded with a base character and
    diacritics.

    But if one wants to show the isolated form of of a Indic vowel, there's a
    orthographic convention to use a sort of "vowel order", i.e. a default
    consonnant, in a way which also happens in the Arabic and Hebrew scripts for the
    default base vowel coded with a base letter.

    Indic scripts offer several variations here because there are also half-forms
    for these vowels, which are not meant to be used isolately but to complement a
    preceding syllable in the same grapheme cluster. It's hard to say which one of
    these forms an author would like to present for these isolated dependant vowels
    because, as their name suggest, they are normally dependant of another preceding
    consonnant.

    So the best way to represent these isolated dependant vowels would be to encode
    an empty/null base consonnant to force the presentation of the dependant vowel.
    An indic text would more probably use one base consonnant and present all
    dependant vowels with that consonnant. Trying to represent the isolated vowel
    creates a theorical grapheme cluster, which is normally not part of the normal
    orthograph of Indic-written words where these vowels are used.

    Another solution would be to code these Indic dependant vowels after the Indic
    letter A (for example after U+0905 DEVANAGARI LETTER A), because this letter
    represents also the default vowel implied by all other consonnants.

    A sample with Devanagari could be: <अा> (U+0905 LETTER A, U+093E VOWEL SIGN AA)
    which should normally be presented like the precomposed: <आ> (U+0906 LETTER AA),
    but which incorrectly displays the dotted circle with the "Mangal" font.

    So an author has to make some notational compromizes here. But still, I do think
    that using NBSP as this empty/null base consonnant before the dependant vowel
    will create the intended Unicode default grapheme cluster. Then it's up to the
    font or renderer to show the NBSP+vowel cluster properly, without the dotted
    circle, but it's not a problem of Unicode itself.

    With NBSP, you get this result: < ा> (U+00A0 NBSP, U+093E VOWEL SIGN AA)
    which often shows a square, probably because many fonts don't have a glyph for
    the isolated form of the vowel sign.

    It is true that this looks like a problem because the dotted circle should not
    appear here after showing the NBSP character (because it creates a single
    grapheme cluster that should be recognized as such, even if this cluster
    contains two combining sequences as it contains two base characters), but the
    problem is in the Mangal font itself (or in the UniScribe engine in Windows),
    not in Unicode.

    In fact you could as well wonder how to represent an isolated form of other
    Indic combining characters like an anusvara or candrabindu, but here also
    Unicode specifies that they should be coded after a space or preferably a NBSP:
    < > (NBSP), < ं> (NBSP, ANUSVARA), < ँ> (NBSP, CANDRABINDU), < ः> (NBSP,
    VISARGA)

    If dotted circles appear before the symbol, or if the symbol is shown with a
    square box for a missing glyph, it's not the fault of Unicode. So the best way
    would be to use a "normal" Indic base character, such as in:
    <अ> (LETTER A), <अं> (LETTER A, ANUSVARA), <अँ> (LETTER A, CANDRABINDU), <अः>
    (LETTER A, VISARGA)
    where the sequences look more familiar with the "normal" Devanagari orthographic
    and calligraphic rendering rules implemented in usual fonts.

    > Also, something which is probably very relevant to Avarangal, fact is the
    > implementation from a major vendor in the field, Microsoft Uniscribe, does
    > retain the dotted circle (if present in the font; if not, you would probably
    > get the .missing glyph instead).

    I'm not sure that UniScribe is the cause of this problem. There just appears to
    exist no GSUB rule in some fonts like Mangal to handle the case of NBSP followed
    by a Indic vowel sign or combining character, to map them to a single glyph
    without the default dotted circle, so UniScribe renders the glyphs it can find
    for the separate codepoints without detecting a "ligature" in that font which
    would have allowed to omit this dotted circle.
    I'm not an expert of UniScribe programming, but there may exist some Indic
    features in Indic fonts, which can be enabled in UniScribe to change the
    rendering behavior by including some additional (optional) GSUB/GPOS tables
    found in the OpenType font, to the rendering process.

    What I can see in the Mangal font shipped with Windows XP for example is that it
    contains many OpenType features for the Devanagari sript in the default language
    system: nukt, akhn, rphf, blwf, half, vatu, pres, abvs, blws, psts, haln, abvm,
    blwm.

    There's not much details about what these feature IDs mean on the help link that
    is provided in the font properties help. But some tools may exist to explain
    what they mean and how they are enabled, and if there's a reference repository
    of these optional "features" for use in applications.
    What I know about them is that they allow changing the rendering with optional
    substitution and positioning tables (from codepoints to glyph IDs), for nuktas,
    half forms, and alternate presentation forms for the R vowel sign.

    But simple applications that don't enable these features by default will not use
    these additional tables, and so will use a "reasonnable" default rendering. May
    be one of these features need to be enabled explicitly by applications to remove
    these dotted circles (this probably requires specific GUI options in
    applications like text editors or word processors to record their use in the
    rich-text document, but I don't know how to enable these features in rich text
    formats like XHTML, even with CSS stylesheets).

    What is clear is that there's no way to enable these features explicitly in
    plain-text files, if there's no standard format control in Unicode to enable
    these OpenType font features. May be these could become new "characters" to
    allocate in plane 14?



    This archive was generated by hypermail 2.1.5 : Fri Mar 26 2004 - 14:01:40 EST