RE: Generic base characters

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Mon Jul 16 2007 - 16:54:12 CDT

  • Next message: Kenneth Whistler: "Re: Generic base characters"

    Kenneth Whistler wrote:
    > > Writing <DEVANAGARI LETTER MA, DEVANAGARI
    > > VOWEL SIGN O, DEVANAGARI VOWEL SIGN O, DEVANAGARI VOWEL SIGN O>
    > > may be an error (or maybe not), but the rendering engine+font
    > > should not care.
    >
    > Ah, but it might, precisely because the layout of Devanagari
    > is not as straightforward as English. And if the rendering is
    > making assumptions about aksara construction (as it might, in
    > order to do rendering correctly and to map into ligatures in

    I would not expect ligatures to work beyond the substrings that
    are commonly occurring in the script (and sometimes it does not
    even work for that, see the fj example).

    > fonts, and so on), it should detect a defective boundary in
    > that sequence and flag it for special treatment.

    And that happens also for plain ASCII text like this:
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    (that was a SINGLE, but rather long row of x:es, no line break
    inside when I wrote it)

    > Now it is debatable exactly *what* that special treatment
    > should be.
    >
    > One reasonable position to take is that aksaras in a complex
    > Indic script outside the canonical structuring rules for an aksara
    > should be rendered with fallbacks that treat each combining
    > mark that doesn't "fit" as a separate layout unit:
    >
    > {MA + O}, {O}, {O}
    >
    > And then you have to decide how to display those two extra
    > matras that don't have an effective base (even though
    > formally, of course, MA is the base character of the
    > combining character sequence here).

    Much similar to <a, c. diaeresis, c. diaeresis, c. diaeresis>,
    though horizontal instead of vertical.

    > One option is to display each on a dotted circle.

    No why?

    > Another option is to display each on a blank.

    No, why?

    > Another reasonable position to take would be that extra
    > matras for an aksara are intentional "misspellings" that
    > users might introduce for effect, and the rendering engine
    > ought to attempt to rendering them as part of an aksara,
    > either by joining them in sequence, or by default stacking
    > rules (depending on their placement, of course). But doing

    Indeed.

    > that requires extension to layout engines and an assessment
    > of whether the tradeoffs involved are worthwhile and in
    > the end are what users expect and require for their text layout.
    >
    > Also, the Devanagari example doesn't express the entire
    > complication here, because it limits itself to an example
    > where the solution involving extending the layout engine
    > (as opposed to fallback display of individual matras
    > on dotted circles or blanks) is easy to visualize.
    > What if the sequence were, instead:
    >
    > MA + I + I + I
    >
    > Then what? In Devanagari, the I-matra reorders to the left
    > around the MA (and possibly other units as well, if present).
    > So is the "reasonable" position now to treat this for
    > display as:
    >
    > {I + MA} + {I} + {I}

    No, why?

    > and use fallback display for the two extra matras?
    >
    > Or is the "reasonable" position to require indefinite
    > leftward reordering of the layout engine, to get:
    >
    > {I + I + I + MA}

    Surely. They should work like any combining category 224
    character, i.e. stack to the left. As long as this stays within
    a line (with some not-too-small preset max), there should
    be no problem. (It would have been better to just give
    the reordrant vowels cc 224 rather than 0!)

    > Or take the Devanagari example MA + O + O + O, but try it
    > in Bengali, instead. The Bengali -O matra is a two-part
    > vowel, with a left side *and* a right side piece. So
    > what is the "reasonable" display of Bengali:
    >
    > MA + O + O + O

    This O has a canonical decompositions to left reordrant
    and rightside character pair. The left part would, as
    above, reorder as a combining category 224 character
    to the left of the combining sequence so far.

    Unfortuantely, some of the later encoded scripts with
    two-side vowels lack a decomposition to left and right
    side characters for those two-side vowels, so then one
    will need some other mechanism to represent the left
    and right parts (PUA code points or extra bits somewhere).

    ...
    > Nor is the business of faithful rendering of Bengali two-part vowels
    > around Tibetan consonant stacks,

    Why should that be a problem in principle? Ignoring ligatures,
    which I would think should not happen cross-script, treating
    the reordering per se would depend only on the combining category,
    or should-have-been combining categories (for the Indic vowels
    that have cc 0), not on the script, though font boundaries per
    se may be a problem (keeping track of and getting the right font,
    not that fonts should handle the reordering; but we have the
    same problem for bidi).

    > I suspect what is really going on here, for Indic scripts,
    > at least, is that there are some nonstandard orthographic
    > extensions (that Sinnathurai Srivas, for example, has alluded
    > to for Tamil), in which users make use of a convention of
    > simply adding more right-side vowel matras to indicate
    > prosodic prolongation. This is the same kind of thing you
    > see all the time in Japanese manga, for example, or informally
    > in English and other alphabetic languages, to indicate
    > reeeeeeeally loooooong vowels.
    >
    > I think the reasonable thing for developers of layout engines
    > and fonts to do is to investigate this kind of orthographic
    > convention and user expectations about it, and then (perhaps)
    > adjust their implementations, so that if a user types
    > in MA + O + O + O + O and expects to just get a long stretched
    > out display of a sequences of O's to indicate this is "moooooooo",
    > instead of "mo", that this will just display correctly without
    > any particular fallbacks required. For the rightside combining marks,
    > at least, which both users and implementations treat a little
    > more like ordinary letters than other combining marks, that
    > might be a good extension to make.

    At the very least. But I don't think there is any real obstacle
    for going further than that.

            /kent k



    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 16:55:52 CDT