RE: Generic base characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 16 2007 - 19:04:30 CDT

  • Next message: Kenneth Whistler: "Loooooooong spellings (was: RE: Generic base characters)"

    Kent Karlsson wrote:

    > > > Writing <DEVANAGARI LETTER MA, DEVANAGARI
    > > > VOWEL SIGN O, DEVANAGARI VOWEL SIGN O, DEVANAGARI VOWEL SIGN O>
    > > > may be an error (or maybe not), but the rendering engine+font
    > > > should not care.
    > >
    > > Ah, but it might, precisely because the layout of Devanagari
    > > is not as straightforward as English. And if the rendering is
    > > making assumptions about aksara construction (as it might, in
    > > order to do rendering correctly and to map into ligatures in
    >
    > I would not expect ligatures to work beyond the substrings that
    > are commonly occurring in the script (and sometimes it does not
    > even work for that, see the fj example).

    I think you are missing the point here. The domain of
    conjunct ligatures in Devanagari is the aksara. You parse for
    aksara boundaries and don't attempt to map into ligature space
    across those boundaries. That is rather different from how
    ligatures work for Latin or for Arabic, for that matter.

    The "substrings" that matter for Devanagari are the aksaras,
    and that is one of the reasons why the rendering is concerned
    with the valid (and defective) aksara boundaries.

    >
    > > fonts, and so on), it should detect a defective boundary in
    > > that sequence and flag it for special treatment.
    >
    > And that happens also for plain ASCII text like this:
          ^^^^
          ???
    > xxxx...
    ...
    > ...xxxx
    >
    > (that was a SINGLE, but rather long row of x:es, no line break
    > inside when I wrote it)

    I think your analogizing is reaching here. Aksara boundary detection
    is rather different from linebreaking. And the Devanagari MA + O + O
    case is one in which a defective aksara boundary occurs that
    you think should be ignored in rendering and where you disagree
    that fallback rendering should occur. The "xxxxx" example is a
    line (which could be any script for that matter -- there is
    nothing inherently Latin about the example) which has no natural
    line break opportunity, and which therefore triggers whatever
    the fallback emergency linebreak routine determines is required
    so as not to overrun the alotted display line.

    So both examples include concepts of "boundary" and of "fallback",
    but other than that are completely different in their implications.

    > > Now it is debatable exactly *what* that special treatment
    > > should be.
    > >
    > > One reasonable position to take is that aksaras in a complex
    > > Indic script outside the canonical structuring rules for an aksara
    > > should be rendered with fallbacks that treat each combining
    > > mark that doesn't "fit" as a separate layout unit:
    > >
    > > {MA + O}, {O}, {O}
    > >
    > > And then you have to decide how to display those two extra
    > > matras that don't have an effective base (even though
    > > formally, of course, MA is the base character of the
    > > combining character sequence here).
    >
    > Much similar to <a, c. diaeresis, c. diaeresis, c. diaeresis>,
    > though horizontal instead of vertical.
    >
    > > One option is to display each on a dotted circle.
    >
    > No why?
    >
    > > Another option is to display each on a blank.
    >
    > No, why?

    Why not? Reasonable people disagree. And when reasonable people
    disagree about cases like this, the usual compromise solution
    is to give them choices, so they can get things to display
    the way they want to.

    But I certainly don't think that the Unicode Standard is
    ever going to mandate that such options as displaying dotted circles
    fpr combining marks that don't fit into canonical aksara
    structure must be avoided.

    > > Another reasonable position to take would be that extra
    > > matras for an aksara are intentional "misspellings" that
    > > users might introduce for effect, and the rendering engine
    > > ought to attempt to rendering them as part of an aksara,
    > > either by joining them in sequence, or by default stacking
    > > rules (depending on their placement, of course). But doing
    >
    > Indeed.
    >

    > > What if the sequence were, instead:
    > >
    > > MA + I + I + I
    > >
    > > Then what? In Devanagari, the I-matra reorders to the left
    > > around the MA (and possibly other units as well, if present).
    > > So is the "reasonable" position now to treat this for
    > > display as:
    > >
    > > {I + MA} + {I} + {I}
    >
    > No, why?

    Why not?

    > > and use fallback display for the two extra matras?
    > >
    > > Or is the "reasonable" position to require indefinite
    > > leftward reordering of the layout engine, to get:
    > >
    > > {I + I + I + MA}
    >
    > Surely. They should work like any combining category 224
    > character, i.e. stack to the left.

    Combining category 224 characters *don't* "stack to the left".
    I defy you to find any part of the Unicode Standard that
    does now or ever has required that left-side combining
    marks stack leftward. That is just an unreasonable thing to require
    of rendering engines.

    There are only two of them, by the way, that have ever been
    defined in the standard.

    302E..302F ; 224 # Mn [2] HANGUL SINGLE DOT TONE MARK..HANGUL DOUBLE DOT
    TONE MARK

    And just as for left side Indic matra combining marks
    (combining class 0), there isn't any reasonable, meaningful
    text reason to stack these. If I encounter a Hangul
    syllable followed by 6 single dot tone marks, I would
    fully expect a renderer to bail after the first of those
    was displayed to the left of the Hangul syllable, and have
    the next 5 be displayed with fallback on a dotted circle
    (or otherwise).

    > As long as this stays within
    > a line (with some not-too-small preset max), there should
    > be no problem. (It would have been better to just give
    > the reordrant vowels cc 224 rather than 0!)

    Mistaken premise. I'm willing to bet that there is indeed
    a problem with expecting rendering engines to stack
    ccc=224 marks indefinitely.

    > > Or take the Devanagari example MA + O + O + O, but try it
    > > in Bengali, instead. The Bengali -O matra is a two-part
    > > vowel, with a left side *and* a right side piece. So
    > > what is the "reasonable" display of Bengali:
    > >
    > > MA + O + O + O
    >
    > This O has a canonical decompositions to left reordrant
    > and rightside character pair. The left part would, as
    > above, reorder as a combining category 224 character
    > to the left of the combining sequence so far.

    Nope.

    > Unfortuantely, some of the later encoded scripts with
    > two-side vowels lack a decomposition to left and right
    > side characters for those two-side vowels, so then one
    > will need some other mechanism to represent the left
    > and right parts (PUA code points or extra bits somewhere).

    You're talking about Khmer, presumably. But it shouldn't
    matter one way or the other whether there is a decomposition.
    The canonical equivalences in the other Indic cases means
    the reordering on display occurs *whether or not* the
    character backing store is decomposed, and the reordering
    happens in glyph space, anyway, not in character space.

    > ...
    > > Nor is the business of faithful rendering of Bengali two-part vowels
    > > around Tibetan consonant stacks,
    >
    > Why should that be a problem in principle? Ignoring ligatures,
    > which I would think should not happen cross-script, treating
    > the reordering per se would depend only on the combining category,

    That is your basic mistake, I think. Reordering depends on
    the context of script behavior. And if you mix script boundaries
    across what is otherwise complex rendering, it is perfectly
    valid for a rendering engine to wave an exception and say,
    effectively, I can't do that -- just as reasonable as saying
    it doesn't know how to ligate Bengali to Tibetan.

    > or should-have-been combining categories (for the Indic vowels
    > that have cc 0), not on the script, though font boundaries per
    > se may be a problem (keeping track of and getting the right font,
    > not that fonts should handle the reordering; but we have the
    > same problem for bidi).
    >

    > > I think the reasonable thing for developers of layout engines
    > > and fonts to do is to investigate this kind of orthographic
    > > convention and user expectations about it, and then (perhaps)
    > > adjust their implementations, so that if a user types
    > > in MA + O + O + O + O and expects to just get a long stretched
    > > out display of a sequences of O's to indicate this is "moooooooo",
    > > instead of "mo", that this will just display correctly without
    > > any particular fallbacks required. For the rightside combining marks,
    > > at least, which both users and implementations treat a little
    > > more like ordinary letters than other combining marks, that
    > > might be a good extension to make.
    >
    > At the very least. But I don't think there is any real obstacle
    > for going further than that.

    And I rather suspect that there are significant obstacles.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 19:06:10 CDT