Re: Generic base characters

From: Asmus Freytag (
Date: Mon Jul 16 2007 - 12:14:21 CDT

  • Next message: Magda Danish (Unicode): "FW: Subj: Amount of Space Unicode Takes"

    On 7/16/2007 9:18 AM, Kent Karlsson wrote:
    > Asmus Freytag wrote:
    >> Such a missing base character is a bug in the text.
    > Yes, but...
    >> Despite
    >> the recommended fallback that you describe, the policy of
    >> making that visible to the author by inserting a dotted
    >> circle is, in principle, reasonable.
    > 1) I'm not so sure about that. It's better to have a single defined
    > behaviour (assuming the characters in question are at all supported).
    In cases like this, you not only have the question of which *characters*
    are supported, but also which *character sequences* are supported. Just
    like a font designed for some language other than Swedish might have a
    glyph for the f, and the j, but, which despite supporting an fi and fl
    ligature does not support an fj liagature, other parts of the layout
    system may legitimately not support some sequences even if they support
    each letter and similar sequences. This is not a conformance issue but
    one of quality and scope of an implementation.
    > 2) NBSP base is for sequences of combining characters preceeded
    > by beginning of string or by a control char. I think using NBSP
    > as the implicit base in such cases is a reasonable behaviour.
    > (Inserting a dotted circle is not.)
    I've always understood that recommendation to be aimed a preventing the
    combining mark from being handled in completely weird ways, e.g. by
    trying to overhang it into empty space at the beginning of a line. I see
    nothing in the standard that prevents a higher level protocol, such as
    Uniscribe, to override this behavior.
    > 3) This thread started talking about there actually being a base
    > present in the text just before the combining sequence, just that
    > the base was in another script (or some symbol/punctuation).
    > That is not an error case from a text rendering point of view.
    > There is no reason to start inserting dotted circle, NBSP,
    > or anything else. Ligation, kerning, postioning adjustments
    > are unlikely to work except for special cases, but some rough
    > approximate (assuming again that the individual characters at
    > all are supported by the rendering system and font used) should
    > be output.
    As I have pointed out, I regard the application of the policy to these
    cases as one of the 'issues', because it can lead to unintended (and
    limiting) results. But I can understand why layout engine creators don't
    want to support an anything goes approach, because doing that at *high
    quality* is extremely expensive. That said, a better way to do the
    fallback would be appropriate. Johns suggested list of generic bases is
    a good way to indicate a minimal level of support.
    >> Authors should not have
    >> an expectation of portably exchanging buggy text with perfect
    A buggy text is one that has missing base characters. That's how I meant
    this usage in my post. If you construed that differently based on some
    real or perceived deficiency in how I worded that, I'm sorry.
    > What is "buggy text" (from a rendering engine+font point of view)?
    > Can you give me an example in English? Writing moooose may be an
    > error (or maybe not), but the rendering engine+font should not care.
    > It may be a spelling error, but it is in no way "buggy text" at
    > the rendering level. Writing <DEVANAGARI LETTER MA, DEVANAGARI
    > may be an error (or maybe not), but the rendering engine+font
    > should not care. It may be a spelling error, but it is in no way
    > "buggy text" at the rendering level. There is a problem with
    > above/below combining characters in that proper stacking will
    > quickly go outside of the line or even page boundary. But that
    > is a problem of a different kind, and not buggy text per se.
    >> fidelity, so making them aware of the problem leads to more
    >> robust interchange.
    > Which problem. Indicating such things as spelling errors is not
    > the business of the low level text renderer.
    See comment above.

    But also, indicating that a renderer can't support something, *is*
    legitimately the business of the implementation. I think that software
    that uses fallbacks for diacritics and that can't rais stacked
    diacritics properly would be better off causing a visible clash or even
    spacing the combining marks than silently overstriking them. As another
    >> Now, there are several problems with this approach (depending
    >> on how it is implemented).
    >> If the policy leads to authors creating didactic texts that
    >> rely on the presence of the dotted circle, that is a problem.
    >> If the implementation prevents users from specifying some
    >> other reasonable base character, and insists to show a dotted
    >> circle nevertheless, that prevents users from creating
    >> reasonable texts, limiting the functionality of the
    > A rendering system should have no opinion on what text is
    > "reasonable" or not.
    A text that it can't support is ipso facto unreasonable - the user needs
    to use a different layout engine for such texts, or where not feasible,
    get the supplier to augment their implementation to be less narrow. But,
    none of that is a conformance issue.
    > I can understand that items that would
    > be 100% confusable and that *should have been* canonically
    > equivalent (but aren't) one of the representations result
    > in blurred text (in some way, like some error glyphs being used).
    > But that needs to be defined in the standard so that everyone
    > makes the same choice of which *should-have-been* equivalent
    > to blurr out, and which to show without blurring. But that is
    > not the case for say <underline, any combining visible character>,
    > nor for <latin letter small x, any combining visible character>.
    > Nor for <lao consonant, lao combining vowel, lao combining vowel,
    > lao combining vowel>.
    > However, <LAO VOWEL SIGN E, LAO VOWEL SIGN E> is 100% confusable
    > with LAO VOWEL SIGN EI, but they are not canonically equivalent
    > (as they should have been) so one needs to be blurred to avoid
    > spoofing possibilities. But neither of these characters are
    > combining!
    no further comment.
    >> implementation. Particularly egregious if an implementation
    >> prevents the user from providing a code point for the dotted
    >> circle explicitly.
    >>> There is no notion of "invalid"/"valid" base character for
    >> a combining
    >>> character in Unicode.
    >> But there is also no notion that an implementation has to
    >> support *all* sequences of characters. It is desirable to
    >> create implementations that don't get in the way of the
    >> users' needs, but in some cases, limiting the capabilities
    >> results in a more stable, more easily tested implementation
    >> that can deliver the *intended* support more correctly and at
    >> times also more cheaply.
    > I'm not sure exactly what this refers to.
    This is the crucial issue. As explained above.

    > But I understand that
    > not everything can be tested. Does spurious dotted circles mean
    > "we did not test this"?
    > /kent k

    This archive was generated by hypermail 2.1.5 : Mon Jul 16 2007 - 12:15:54 CDT