RE: Generic base characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 16 2007 - 15:29:35 CDT

Next message: Kent Karlsson: "RE: Generic base characters"

Previous message: Michael Maxwell: "RE: Generic base characters"
Maybe in reply to: Peter Constable: "RE: Generic base characters"
Next in thread: Raymond Mercier: "Re: Generic base characters"
Reply: Raymond Mercier: "Re: Generic base characters"
Reply: Kent Karlsson: "RE: Generic base characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

O.k., I was going to stay out of this, but...

> > Authors should not have
> > an expectation of portably exchanging buggy text with perfect
>
> What is "buggy text" (from a rendering engine+font point of view)?
> Can you give me an example in English?

No. Because English, per se, doesn't have a complex orthography.
It makes little use of combining marks, and the basic rendering
is just left-to-right spacing characters in a row. There is very
little you can do in English itself that would confuse a
rendering engine. You can veer off into total gibberish
in ASCII, but still the rendering won't meet any edge conditions.

> Writing moooose may be an
> error (or maybe not), but the rendering engine+font should not care.
> It may be a spelling error, but it is in no way "buggy text" at
> the rendering level.

Correct.

> Writing <DEVANAGARI LETTER MA, DEVANAGARI
> VOWEL SIGN O, DEVANAGARI VOWEL SIGN O, DEVANAGARI VOWEL SIGN O>
> may be an error (or maybe not), but the rendering engine+font
> should not care.

Ah, but it might, precisely because the layout of Devanagari
is not as straightforward as English. And if the rendering is
making assumptions about aksara construction (as it might, in
order to do rendering correctly and to map into ligatures in
fonts, and so on), it should detect a defective boundary in
that sequence and flag it for special treatment.

Now it is debatable exactly *what* that special treatment
should be.

One reasonable position to take is that aksaras in a complex
Indic script outside the canonical structuring rules for an aksara
should be rendered with fallbacks that treat each combining
mark that doesn't "fit" as a separate layout unit:

{MA + O}, {O}, {O}

And then you have to decide how to display those two extra
matras that don't have an effective base (even though
formally, of course, MA is the base character of the
combining character sequence here). One option is to
display each on a dotted circle. Another option is to
display each on a blank.

Another reasonable position to take would be that extra
matras for an aksara are intentional "misspellings" that
users might introduce for effect, and the rendering engine
ought to attempt to rendering them as part of an aksara,
either by joining them in sequence, or by default stacking
rules (depending on their placement, of course). But doing
that requires extension to layout engines and an assessment
of whether the tradeoffs involved are worthwhile and in
the end are what users expect and require for their text layout.

Also, the Devanagari example doesn't express the entire
complication here, because it limits itself to an example
where the solution involving extending the layout engine
(as opposed to fallback display of individual matras
on dotted circles or blanks) is easy to visualize.
What if the sequence were, instead:

MA + I + I + I

Then what? In Devanagari, the I-matra reorders to the left
around the MA (and possibly other units as well, if present).
So is the "reasonable" position now to treat this for
display as:

{I + MA} + {I} + {I}

and use fallback display for the two extra matras?

Or is the "reasonable" position to require indefinite
leftward reordering of the layout engine, to get:

{I + I + I + MA}

I think in that case that reworking the rendering engine to
keep reordering as many left-side matras as it encounters
would in fact be unreasonable. It would be complicated and
messy, *and* it likely would not match user expectations
in any case.

Or take the Devanagari example MA + O + O + O, but try it
in Bengali, instead. The Bengali -O matra is a two-part
vowel, with a left side *and* a right side piece. So
what is the "reasonable" display of Bengali:

MA + O + O + O

Hmmm?

> It may be a spelling error, but it is in no way
> "buggy text" at the rendering level. There is a problem with
> above/below combining characters in that proper stacking will
> quickly go outside of the line or even page boundary. But that
> is a problem of a different kind, and not buggy text per se.

I think you are confusing Asmus' use of the term "buggy text"
with some notion of "illegal text".

In Unicode, I think we are all in agreement, there basically
is no such thing as illegal text sequences (as long as you
stick to valid Unicode graphic code points, and stay away from
noncharacters and surrogate code points). The result may
be utter gibberish, but it isn't "illegal".

But even marginally comprehensible departures from standard
orthographic rules can create serious problems for layout
engines in complex scripts. (See examples above.) So it
is relatively easy to come up with sequences in Indic
languages, for example, that will trigger fallback behavior
in even the most sophisticated of layout engines -- and in
that sense could be considered "buggy text". Either the text
sequence will hit some unexpected edge condition in the
layout engine, exercising an actual bug in the layout engine
itself, or it will be handled by *some* kind of fallback as
the only reasonable alternative when some limit is hit --
and at that point, the outcome is likely to be viewed as
"buggy display" by users of the script in question.

And much earlier in this thread (or the preceding threads that
led to this thread) I pointed out that expecting layout engines
to gracefully handle *cross*-script combinations of bases
and combining marks in complex scripts was an unreasonable
expectation. So such sequences, while in no sense illegal
in Unicode, would also constitute "buggy text" for which
layout engines won't do much other than fallback
display of the combining marks.

> > fidelity, so making them aware of the problem leads to more
> > robust interchange.
>
> Which problem. Indicating such things as spelling errors is not
> the business of the low level text renderer.

Nor is the business of faithful rendering of Bengali two-part vowels
around Tibetan consonant stacks, if that happens to be
the "spelling error" in question.

I suspect what is really going on here, for Indic scripts,
at least, is that there are some nonstandard orthographic
extensions (that Sinnathurai Srivas, for example, has alluded
to for Tamil), in which users make use of a convention of
simply adding more right-side vowel matras to indicate
prosodic prolongation. This is the same kind of thing you
see all the time in Japanese manga, for example, or informally
in English and other alphabetic languages, to indicate
reeeeeeeally loooooong vowels.

I think the reasonable thing for developers of layout engines
and fonts to do is to investigate this kind of orthographic
convention and user expectations about it, and then (perhaps)
adjust their implementations, so that if a user types
in MA + O + O + O + O and expects to just get a long stretched
out display of a sequences of O's to indicate this is "moooooooo",
instead of "mo", that this will just display correctly without
any particular fallbacks required. For the rightside combining marks,
at least, which both users and implementations treat a little
more like ordinary letters than other combining marks, that
might be a good extension to make.

--Ken

Next message: Kent Karlsson: "RE: Generic base characters"
Previous message: Michael Maxwell: "RE: Generic base characters"
Maybe in reply to: Peter Constable: "RE: Generic base characters"
Next in thread: Raymond Mercier: "Re: Generic base characters"
Reply: Raymond Mercier: "Re: Generic base characters"
Reply: Kent Karlsson: "RE: Generic base characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 15:31:00 CDT