Re: Combining across markup?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 12 2004 - 15:53:12 CDT

  • Next message: Markus Scherer: "Re: Wide Characters in Windows and UTF16"

    From: "Doug Ewell" <dewell@adelphia.net>
    > The suggestion to add a "mark-color" capability to CSS might handle a
    > majority of the realistic situations where color is really understood to
    > be part of the textual content. Peter's two combining marks, a black
    > one in the actual manuscript and a red one added by the editor, sounds
    > less like a problem that Unicode or W3C need to worry about.

    Note: this message is quite long, sorry. It exposes several ideas to solve
    the problem of markup or styling of combining characters (below the Unicode
    character model).

    This is probably out of scope of Unicode itself, but the need to encode
    isolated diacritics in XML implies the need to be able to encode defective
    combining sequences in text elements or attribute values.

    Shamely, in XML, the only way to encode it without breaking the XML syntax
    when the document is normalized is to encode the combining characters that
    start the defective combining sequence is to encode them with (numeric or
    named) character entities like #x300; or &acute; within text elements, or
    within attribute values, so that they will not collide with the previous
    quote mark (leading a attribute value) or with the previous closing angle
    mark (that terminates the element's start tag that necessarily comes before
    text elements which must be part of an element content in a well-formed XML
    document).

    One bad thing of this approach is that XML documents are subject to
    transformations (through DOM or SAX or similar APIs that can generate new
    documents or fragments), so not all valid plain-text sequences can be
    encoded safely with the same way.

    However:

    - Unicode however also allows delimiting defective sequences after control
    characters like end-of-line control characters.
    - well-formed XML documents have a limited set of control characters that
    can be inserted in the encoded XML syntax: CR, LF, TAB, NL... These control
    characters are considered "whitespaces" in XML and subject to an optional
    white-space normalization within text elements (but not in attribute
    values...)
    - the solution would then be to force the insertion of such a control
    character at the start of the plain-text-encoding of the XML attribute
    value, or at the start of the plain-text encoding of a text element (within
    the content of another element);
    - but then, technically, this control character becomes part of the text
    element content (unless there's a xml:whitespace specifier that indicates to
    the document parser that this character must be ignored as it is blank), or
    of the attribute value.

    So what can we do to allow encoding defective combining sequences in XML?
    - for attribute values, there's currently nothing we can do to avoid making
    this control character part of the actual value; this is a limitation of the
    XML syntax itself.
    - for text elements, we could have the container element specify that the
    leading control is not a whitespace but only necessary to make the document
    still well-formed after normalization. This could be a sort of
    xml:controldefective attribute added to the parent element, and that clearly
    indicates to the parser that the leading control must be removed from the
    effective text element content.

    All these seem to indicate that XML document generators can safely encode a
    document containing defective combining sequences, provided that they know
    that these sequences will be defective. This requires that XML document
    generators be able to detect them when they are leading text elements or
    attribute values.

    Another problem comes with elements whose content is marked to be
    normalizable (in the schema definition of the container element or
    explicitly with container elements that specify a whitespace normalization
    in their xml:whitespace attribute): the XML whitespace normalization must
    not strip this leading control or whitespace, but must still normalize the
    whitespaces in the rest of the encoded text string.

    Now comes the problem of creating documents with the markup necessary to
    give specific styles or colors to diacritics. A first natural approach is to
    surround the encoded defective combining sequence as the content of a
    styling XML element. If the XML document generator does not know that the
    combining sequence is defective, many problems will occur.

    The consequence is that XML document generators must be able to detect
    combining characters, and thus include at least a vector of known combining
    characters, that must be encoded with character entities, and not as plain
    text in the normal case (because they would behave badly through Unicode
    normalization of documents.)

    (Note that <![CDATA[...]]> sections will not help here, because the
    defective sequence will appear just after the second '[' with which it will
    combine in Unicode, possibly creating a combining sequence that breaks the
    XML syntax if a Unicode normalization is applied to the document!).

    If XML document generators (or editors...) are made aware of this problem,
    then they will safely encode things like:

        <?xml version="1.0"?>
        <document>
           normal text and a letter e with a colored grave accent:
           e<text style="color:red;">&#x300;</style>.
        </document>

    or (less natural because the document content without markup excludes now
    the diacritic):

        <?xml version="1.0"?>
        <document>
           normal text and a letter e with a colored grave accent:
           e<diacritic style="color:red;" value="&#x300;"/>.
        </document>

    The key issue solved here is that &#x300; MUST NOT be replaced by its
    plain-text equivalent, or it will create a non-defective combining sequence
    that spans the closing '>' (first example) or the leading double-quote
    (second example), which could be transformed by Unicode normalization (or
    reencoding to another charset than an Unicode UTF...) applied to the whole
    document, and that could then break the wellformed-ness of the document's
    XML syntax.

    So, until here, we have solved the first problem: being able to represent
    XML data containing defective combining sequences. But the bad thing is that
    we have broken the text into separate text entities.

    This is where document authors are left with many ambiguities about how to
    "join" visually these rendered entities, as there is now no place to
    correctly position the separate diacritic with the last letter of the
    previous text element.

    Without knowledge or explicit specification of the precise font's {face,
    size, style, weight} used to render the document, styling the diacritic with
    additional positioning becomes impossible; this task should then be
    performed by the document's renderer when it will have this information, but
    a classic XML renderer (including HTML renderers in browsers) will behva
    very poorly here, because it will render each substring with separate
    invokation of the plain-text renderer (a HTML renderer normally computes a
    bounding box for each text fragment, and positions these boxes side by side
    on rows, but the bounding box of the diacritic should not use this
    convention: this breaks the box model of HTML (or of similar rich-text
    formats, including RTF or Word documents, or even PDF files).

    So let's suppose that we won't attempt to style diacritics separately, so
    that we will not need to encode defective combining sequences. The styling
    information for the diacritic must then be coded outside of the combining
    sequence, and not mixed within it:

        <?xml version="1.0"?>
        <document>
           normal text and a letter e with a colored grave accent:
           <span class="special">e#x300;</span>.
        </document>

    Styling is then applied to the combining sequence as a whole, and now it
    preserves the Unicode character model. But this requires some more
    capabilities to the style language, in order to give separate styles to
    substrings of the same text element. This is what CSS cannot perform now,
    i.e. something like:

        <style type="text/css">
           .special #text[1] { color: red; }
        </style>

    with a syntax allowing to select only the second character (at index 1) of a
    text element... This will become tricky when the document itself will be
    normalized, because this must select items within combining sequences,
    something that CSS is not prepared to do naturally, notably because the main
    document to style could have been transformed by a Unicode normalizer or a
    charset reencoder:

        <?xml version="1.0"?>
        <document>
           normal text and a letter e with a colored grave accent:
           <span class="special"></span>.
        </document>

    Note that Unicode normalization or charset reencoding generates a XML
    document that is not technically equivalent to the previous version of the
    document: XML considers a decomposed "e#x300;" distinct from a precombined
    ""...
    In the example above, there's now a single precombined character '', and
    the CSS styler would need to understand that it should apply styling to the
    accent coded within the same single character. The above sample syntax for
    CSS will not work reliably.

    So we need something even smarter in the CSS style language, to select text
    items that are coded below the Unicode character level and below the
    rich-text-format character model! Here we must give to CSS the capability of
    specifying the normalization form to apply before selecting text items, or
    give to CSS some additional meta-selectors that allow selecting items like
    "diacritics-only".

    The way with meta-selectors seems tricky, as there will be infinite number
    of candidates. Specifing the normalization form to apply seems much more
    simple:

        <style type="text/css">
           .special #text.NFD[1] { color: red; }
        </style>

    So we have something workable with an example XML/HTML document:

        <?xml version="1.0">
        <html>
            <head>
                <title>Example</title>
                <style type="text/css">
                    .special #text.NFD[1] { color: red; }
                </style>
            </head>
            <body>
                <p>normal text and a letter e with a colored grave accent:
                <span class="special"></span>.</p>
            </body>
        </html>
    (I ignored the optional <!DOCTYPE> declaration in this example)

    The document like this respects the abstract character model of both Unicode
    and HTML, and it can safely be normalized with Unicode normalizers, and
    safely handled through XML parsers or generators.
    [This will work as long as the text does not need to be reencoded into a
    "defective" charset that lacks some diacritics or precomposed characters.
    When this conversion is lossy, the special style rule will not have the
    desired effect. However, one can argue that the author of such styled
    document wants that its accents or diacritics be present and not be lost in
    this transformation. A target charset that would not have a 'e-grave' in it
    would not be appropriate to represent the document in which the author
    wanted to emphasize the presence of the grave accent over a letter e. So
    this is probably not a limitation, and today Unicode offers several UTFs
    that will work as non-lossy charsets; who would want something else?]

    One thing to conclude: CSS is currently not able to specify such extended
    selectors. This is a place for extension of the CSS language, but this is
    not a problem of Unicode itself. And this requires work in other areas also
    out of Unicode itself:

    (1) The style language (CSS in our example) must be extended to allow such
    "smart" styles.

    (2) To have such selectors work effectively, the style renderer must be
    extended to work with plain-text renderers so that it will keep the
    contextual shaping and positioning information that will allow rendering
    separate substrings for the base character and the diacritics, each one with
    distinct styles or colors.

    (3) The plain-text renderers (that work with fonts) must extend their API to
    allow rendering separate fragments of texts with a "restartable" state
    between each text fragment.

    (4) The rich-text-format renderer (HTML in our example) must be made aware
    that its box model (used to place the various elements in a document's page)
    will be now under control of the style renderer. There will be intimate
    collaboration here to allow computing the various bounding boxes, and with a
    "flow" layout, computing rows of items will become much more complex
    (imagine it when it must also apply text justification, or when it must
    autoadjust the column widths of tables...)

    (5) For general XML text-handling outside of any renderer consideration, the
    XML document generator or editor must have knowledge of the complete list of
    combining characters that may cause problems in ANY charset considered
    (including with charset reencoding).



    This archive was generated by hypermail 2.1.5 : Thu Aug 12 2004 - 15:55:26 CDT