RE: Generic base characters

From: Kent Karlsson (kent.karlsson14@comhem.se)
Date: Mon Jul 16 2007 - 16:54:12 CDT

Next message: Kenneth Whistler: "Re: Generic base characters"

Previous message: Raymond Mercier: "Re: Generic base characters"
In reply to: Kenneth Whistler: "RE: Generic base characters"
Next in thread: Kenneth Whistler: "RE: Generic base characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kenneth Whistler wrote:
> > Writing <DEVANAGARI LETTER MA, DEVANAGARI
> > VOWEL SIGN O, DEVANAGARI VOWEL SIGN O, DEVANAGARI VOWEL SIGN O>
> > may be an error (or maybe not), but the rendering engine+font
> > should not care.
>
> Ah, but it might, precisely because the layout of Devanagari
> is not as straightforward as English. And if the rendering is
> making assumptions about aksara construction (as it might, in
> order to do rendering correctly and to map into ligatures in

I would not expect ligatures to work beyond the substrings that
are commonly occurring in the script (and sometimes it does not
even work for that, see the fj example).

> fonts, and so on), it should detect a defective boundary in
> that sequence and flag it for special treatment.

And that happens also for plain ASCII text like this:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

(that was a SINGLE, but rather long row of x:es, no line break
inside when I wrote it)

> Now it is debatable exactly *what* that special treatment
> should be.
>
> One reasonable position to take is that aksaras in a complex
> Indic script outside the canonical structuring rules for an aksara
> should be rendered with fallbacks that treat each combining
> mark that doesn't "fit" as a separate layout unit:
>
> {MA + O}, {O}, {O}
>
> And then you have to decide how to display those two extra
> matras that don't have an effective base (even though
> formally, of course, MA is the base character of the
> combining character sequence here).

Much similar to <a, c. diaeresis, c. diaeresis, c. diaeresis>,
though horizontal instead of vertical.

> One option is to display each on a dotted circle.

No why?

> Another option is to display each on a blank.

No, why?

> Another reasonable position to take would be that extra
> matras for an aksara are intentional "misspellings" that
> users might introduce for effect, and the rendering engine
> ought to attempt to rendering them as part of an aksara,
> either by joining them in sequence, or by default stacking
> rules (depending on their placement, of course). But doing

Indeed.

> that requires extension to layout engines and an assessment
> of whether the tradeoffs involved are worthwhile and in
> the end are what users expect and require for their text layout.
>
> Also, the Devanagari example doesn't express the entire
> complication here, because it limits itself to an example
> where the solution involving extending the layout engine
> (as opposed to fallback display of individual matras
> on dotted circles or blanks) is easy to visualize.
> What if the sequence were, instead:
>
> MA + I + I + I
>
> Then what? In Devanagari, the I-matra reorders to the left
> around the MA (and possibly other units as well, if present).
> So is the "reasonable" position now to treat this for
> display as:
>
> {I + MA} + {I} + {I}

No, why?

> and use fallback display for the two extra matras?
>
> Or is the "reasonable" position to require indefinite
> leftward reordering of the layout engine, to get:
>
> {I + I + I + MA}

Surely. They should work like any combining category 224
character, i.e. stack to the left. As long as this stays within
a line (with some not-too-small preset max), there should
be no problem. (It would have been better to just give
the reordrant vowels cc 224 rather than 0!)

> Or take the Devanagari example MA + O + O + O, but try it
> in Bengali, instead. The Bengali -O matra is a two-part
> vowel, with a left side *and* a right side piece. So
> what is the "reasonable" display of Bengali:
>
> MA + O + O + O

This O has a canonical decompositions to left reordrant
and rightside character pair. The left part would, as
above, reorder as a combining category 224 character
to the left of the combining sequence so far.

Unfortuantely, some of the later encoded scripts with
two-side vowels lack a decomposition to left and right
side characters for those two-side vowels, so then one
will need some other mechanism to represent the left
and right parts (PUA code points or extra bits somewhere).

...
> Nor is the business of faithful rendering of Bengali two-part vowels
> around Tibetan consonant stacks,

Why should that be a problem in principle? Ignoring ligatures,
which I would think should not happen cross-script, treating
the reordering per se would depend only on the combining category,
or should-have-been combining categories (for the Indic vowels
that have cc 0), not on the script, though font boundaries per
se may be a problem (keeping track of and getting the right font,
not that fonts should handle the reordering; but we have the
same problem for bidi).

> I suspect what is really going on here, for Indic scripts,
> at least, is that there are some nonstandard orthographic
> extensions (that Sinnathurai Srivas, for example, has alluded
> to for Tamil), in which users make use of a convention of
> simply adding more right-side vowel matras to indicate
> prosodic prolongation. This is the same kind of thing you
> see all the time in Japanese manga, for example, or informally
> in English and other alphabetic languages, to indicate
> reeeeeeeally loooooong vowels.
>
> I think the reasonable thing for developers of layout engines
> and fonts to do is to investigate this kind of orthographic
> convention and user expectations about it, and then (perhaps)
> adjust their implementations, so that if a user types
> in MA + O + O + O + O and expects to just get a long stretched
> out display of a sequences of O's to indicate this is "moooooooo",
> instead of "mo", that this will just display correctly without
> any particular fallbacks required. For the rightside combining marks,
> at least, which both users and implementations treat a little
> more like ordinary letters than other combining marks, that
> might be a good extension to make.

At the very least. But I don't think there is any real obstacle
for going further than that.

/kent k

Next message: Kenneth Whistler: "Re: Generic base characters"
Previous message: Raymond Mercier: "Re: Generic base characters"
In reply to: Kenneth Whistler: "RE: Generic base characters"
Next in thread: Kenneth Whistler: "RE: Generic base characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 16:55:52 CDT