Re: markup on combining characters

From: Philippe Verdy (
Date: Wed Sep 08 2004 - 02:49:16 CDT

  • Next message: Peter Kirk: "Re: markup on combining characters"

    From: "Jony Rosenne" <>
    > Peter Kirk
    >> You mean, you would represent a black e with a red acute accent as
    >> something like "e", ZWJ, "<red>", IBC, acute, "</red>"? That
    >> looks like
    >> a nightmare for all kinds of processing and a nightmare for rendering.
    > No, it is more like <forecolor:black, combiningcolor:red> "e" "acute"
    > And there is no Unicode decision against it.

    And still no decision if this invisible base character will be added or not.
    It's just a public review for now, to address the first issue of rendering
    isolated non-spacing combining marks that currently don't have a spacing
    variant (I think it's a good idea as it would avoid adding most of the
    missing ones, notably for the non-generic L/G/C combining marks).

    Note that your suggestion of:
       <forecolor:black, combiningcolor:red> "e" "acute"
    should also work with any normalized form of the same text, i.e. with:
       <forecolor:black, combiningcolor:red> "e with acute"
    where the combining mark is composed. The issue here is that this becomes
    tricky for renderers that will need to redecompose strings in normalized
    forms, before applying style.
    Basically I prefer the Peter solution with:
       "e", ZWJ?, "<red>", IBC, acute, "</red>"
    which is more independant of the normalization form. Then the question is
    whever the text within <red>...</red> markup should combine visually when

    For now I see the proposed IBC (no name for it for now) only as a way to
    transform non-spacing combining marks in spacing non-combining variants,
    when they dont exist separately in Unicode (so this would not be recommanded
    for the non-spacing acute accent which already has a spacing version that
    does not require using a leading IBC.)
    Technically, if an IBC character is added, a renderer will not necessarily
    render <IBC, non-spacing combining acute> the same way as <spacing
    non-combining acute accent>, even if it should better do so.
    In this past sentence, the "should" means that the existing spacing
    non-combining marks are left as the standard legacy way to encode them, and
    they normally don't combine when rendered after a base letter, even if
    there's markup around them (except if this markup explicitly says that they
    should combine):

    If I take the above example,
        "e", ZWJ?, "<red>", IBC, acute, "</red>"
    the same rich-text should also be renderable without the markup in
    plain-text as if it was:
        "e", ZWJ?, IBC, acute
    i.e. (with the "should" above) like if it was also:
        "e", ZWJ?, spacing acute
    I have placed the "?" symbol after ZWJ to exhibit the fact that something
    would be necessary to allow this last text to remove the non-combining
    non-spacing behavior of the spacing acute character. Without it, the text:
        "e", spacing acute
    or equivalently (with the should above):
        "e", IBC, combining acute
    would not be allowed to render a combined e with an accute, and two separate
    glyphs would be rendered, and two separate character entities interpreted
    (as they are today in legacy plain-texts).

    So the question remains about how to add markup on combining marks: the
    proposed IBC alone cannot solve such problems, unless there's an agreement
    that ZWJ immediately followed by IBC should be rendered as if they were not
    present (but in that case, a spacing acute becomes semantically and
    graphically distinct from <IBC, combining acute>: this is what will happen
    in any case with normalization forms due to the Unicode stability policy, as
    existing spacing marks must remain undecomposable in NFD or NFKD forms).

    I also note that IBC is intended to replace the need to use a standard SPACE
    as the base character for building a spacing variant of combining marks when
    there's no standard spacing variant encoded in Unicode (this is a legacy
    hack, which causes various problems because of whitespace normalization in
    many plain-text formats or applications, or in XML and HTML, and the special
    word-breaking behavior of spaces). I don't see it as a way to deprecate the
    existing block of spacing marks.

    This archive was generated by hypermail 2.1.5 : Wed Sep 08 2004 - 02:51:32 CDT