RE: Generic base characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 16 2007 - 15:29:35 CDT

  • Next message: Kent Karlsson: "RE: Generic base characters"

    O.k., I was going to stay out of this, but...

    > > Authors should not have
    > > an expectation of portably exchanging buggy text with perfect
    >
    > What is "buggy text" (from a rendering engine+font point of view)?
    > Can you give me an example in English?

    No. Because English, per se, doesn't have a complex orthography.
    It makes little use of combining marks, and the basic rendering
    is just left-to-right spacing characters in a row. There is very
    little you can do in English itself that would confuse a
    rendering engine. You can veer off into total gibberish
    in ASCII, but still the rendering won't meet any edge conditions.

    > Writing moooose may be an
    > error (or maybe not), but the rendering engine+font should not care.
    > It may be a spelling error, but it is in no way "buggy text" at
    > the rendering level.

    Correct.

    > Writing <DEVANAGARI LETTER MA, DEVANAGARI
    > VOWEL SIGN O, DEVANAGARI VOWEL SIGN O, DEVANAGARI VOWEL SIGN O>
    > may be an error (or maybe not), but the rendering engine+font
    > should not care.

    Ah, but it might, precisely because the layout of Devanagari
    is not as straightforward as English. And if the rendering is
    making assumptions about aksara construction (as it might, in
    order to do rendering correctly and to map into ligatures in
    fonts, and so on), it should detect a defective boundary in
    that sequence and flag it for special treatment.

    Now it is debatable exactly *what* that special treatment
    should be.

    One reasonable position to take is that aksaras in a complex
    Indic script outside the canonical structuring rules for an aksara
    should be rendered with fallbacks that treat each combining
    mark that doesn't "fit" as a separate layout unit:

    {MA + O}, {O}, {O}

    And then you have to decide how to display those two extra
    matras that don't have an effective base (even though
    formally, of course, MA is the base character of the
    combining character sequence here). One option is to
    display each on a dotted circle. Another option is to
    display each on a blank.

    Another reasonable position to take would be that extra
    matras for an aksara are intentional "misspellings" that
    users might introduce for effect, and the rendering engine
    ought to attempt to rendering them as part of an aksara,
    either by joining them in sequence, or by default stacking
    rules (depending on their placement, of course). But doing
    that requires extension to layout engines and an assessment
    of whether the tradeoffs involved are worthwhile and in
    the end are what users expect and require for their text layout.

    Also, the Devanagari example doesn't express the entire
    complication here, because it limits itself to an example
    where the solution involving extending the layout engine
    (as opposed to fallback display of individual matras
    on dotted circles or blanks) is easy to visualize.
    What if the sequence were, instead:

    MA + I + I + I

    Then what? In Devanagari, the I-matra reorders to the left
    around the MA (and possibly other units as well, if present).
    So is the "reasonable" position now to treat this for
    display as:

    {I + MA} + {I} + {I}

    and use fallback display for the two extra matras?

    Or is the "reasonable" position to require indefinite
    leftward reordering of the layout engine, to get:

    {I + I + I + MA}

    I think in that case that reworking the rendering engine to
    keep reordering as many left-side matras as it encounters
    would in fact be unreasonable. It would be complicated and
    messy, *and* it likely would not match user expectations
    in any case.

    Or take the Devanagari example MA + O + O + O, but try it
    in Bengali, instead. The Bengali -O matra is a two-part
    vowel, with a left side *and* a right side piece. So
    what is the "reasonable" display of Bengali:

    MA + O + O + O

    Hmmm?

    > It may be a spelling error, but it is in no way
    > "buggy text" at the rendering level. There is a problem with
    > above/below combining characters in that proper stacking will
    > quickly go outside of the line or even page boundary. But that
    > is a problem of a different kind, and not buggy text per se.

    I think you are confusing Asmus' use of the term "buggy text"
    with some notion of "illegal text".

    In Unicode, I think we are all in agreement, there basically
    is no such thing as illegal text sequences (as long as you
    stick to valid Unicode graphic code points, and stay away from
    noncharacters and surrogate code points). The result may
    be utter gibberish, but it isn't "illegal".

    But even marginally comprehensible departures from standard
    orthographic rules can create serious problems for layout
    engines in complex scripts. (See examples above.) So it
    is relatively easy to come up with sequences in Indic
    languages, for example, that will trigger fallback behavior
    in even the most sophisticated of layout engines -- and in
    that sense could be considered "buggy text". Either the text
    sequence will hit some unexpected edge condition in the
    layout engine, exercising an actual bug in the layout engine
    itself, or it will be handled by *some* kind of fallback as
    the only reasonable alternative when some limit is hit --
    and at that point, the outcome is likely to be viewed as
    "buggy display" by users of the script in question.

    And much earlier in this thread (or the preceding threads that
    led to this thread) I pointed out that expecting layout engines
    to gracefully handle *cross*-script combinations of bases
    and combining marks in complex scripts was an unreasonable
    expectation. So such sequences, while in no sense illegal
    in Unicode, would also constitute "buggy text" for which
    layout engines won't do much other than fallback
    display of the combining marks.

    > > fidelity, so making them aware of the problem leads to more
    > > robust interchange.
    >
    > Which problem. Indicating such things as spelling errors is not
    > the business of the low level text renderer.

    Nor is the business of faithful rendering of Bengali two-part vowels
    around Tibetan consonant stacks, if that happens to be
    the "spelling error" in question.

    I suspect what is really going on here, for Indic scripts,
    at least, is that there are some nonstandard orthographic
    extensions (that Sinnathurai Srivas, for example, has alluded
    to for Tamil), in which users make use of a convention of
    simply adding more right-side vowel matras to indicate
    prosodic prolongation. This is the same kind of thing you
    see all the time in Japanese manga, for example, or informally
    in English and other alphabetic languages, to indicate
    reeeeeeeally loooooong vowels.

    I think the reasonable thing for developers of layout engines
    and fonts to do is to investigate this kind of orthographic
    convention and user expectations about it, and then (perhaps)
    adjust their implementations, so that if a user types
    in MA + O + O + O + O and expects to just get a long stretched
    out display of a sequences of O's to indicate this is "moooooooo",
    instead of "mo", that this will just display correctly without
    any particular fallbacks required. For the rightside combining marks,
    at least, which both users and implementations treat a little
    more like ordinary letters than other combining marks, that
    might be a good extension to make.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 15:31:00 CDT