Re: Combining across markup?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 12 2004 - 15:53:12 CDT

Next message: Markus Scherer: "Re: Wide Characters in Windows and UTF16"

Previous message: Philippe Verdy: "Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)"
In reply to: Doug Ewell: "Re: Combining across markup?"
Next in thread: saqqara: "Re: Combining across markup?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Doug Ewell" <dewell@adelphia.net>
> The suggestion to add a "mark-color" capability to CSS might handle a
> majority of the realistic situations where color is really understood to
> be part of the textual content. Peter's two combining marks, a black
> one in the actual manuscript and a red one added by the editor, sounds
> less like a problem that Unicode or W3C need to worry about.

Note: this message is quite long, sorry. It exposes several ideas to solve
the problem of markup or styling of combining characters (below the Unicode
character model).

This is probably out of scope of Unicode itself, but the need to encode
isolated diacritics in XML implies the need to be able to encode defective
combining sequences in text elements or attribute values.

Shamely, in XML, the only way to encode it without breaking the XML syntax
when the document is normalized is to encode the combining characters that
start the defective combining sequence is to encode them with (numeric or
named) character entities like #x300; or ´ within text elements, or
within attribute values, so that they will not collide with the previous
quote mark (leading a attribute value) or with the previous closing angle
mark (that terminates the element's start tag that necessarily comes before
text elements which must be part of an element content in a well-formed XML
document).

One bad thing of this approach is that XML documents are subject to
transformations (through DOM or SAX or similar APIs that can generate new
documents or fragments), so not all valid plain-text sequences can be
encoded safely with the same way.

However:

- Unicode however also allows delimiting defective sequences after control
characters like end-of-line control characters.
- well-formed XML documents have a limited set of control characters that
can be inserted in the encoded XML syntax: CR, LF, TAB, NL... These control
characters are considered "whitespaces" in XML and subject to an optional
white-space normalization within text elements (but not in attribute
values...)
- the solution would then be to force the insertion of such a control
character at the start of the plain-text-encoding of the XML attribute
value, or at the start of the plain-text encoding of a text element (within
the content of another element);
- but then, technically, this control character becomes part of the text
element content (unless there's a xml:whitespace specifier that indicates to
the document parser that this character must be ignored as it is blank), or
of the attribute value.

So what can we do to allow encoding defective combining sequences in XML?
- for attribute values, there's currently nothing we can do to avoid making
this control character part of the actual value; this is a limitation of the
XML syntax itself.
- for text elements, we could have the container element specify that the
leading control is not a whitespace but only necessary to make the document
still well-formed after normalization. This could be a sort of
xml:controldefective attribute added to the parent element, and that clearly
indicates to the parser that the leading control must be removed from the
effective text element content.

All these seem to indicate that XML document generators can safely encode a
document containing defective combining sequences, provided that they know
that these sequences will be defective. This requires that XML document
generators be able to detect them when they are leading text elements or
attribute values.

Another problem comes with elements whose content is marked to be
normalizable (in the schema definition of the container element or
explicitly with container elements that specify a whitespace normalization
in their xml:whitespace attribute): the XML whitespace normalization must
not strip this leading control or whitespace, but must still normalize the
whitespaces in the rest of the encoded text string.

Now comes the problem of creating documents with the markup necessary to
give specific styles or colors to diacritics. A first natural approach is to
surround the encoded defective combining sequence as the content of a
styling XML element. If the XML document generator does not know that the
combining sequence is defective, many problems will occur.

The consequence is that XML document generators must be able to detect
combining characters, and thus include at least a vector of known combining
characters, that must be encoded with character entities, and not as plain
text in the normal case (because they would behave badly through Unicode
normalization of documents.)

(Note that <![CDATA[...]]> sections will not help here, because the
defective sequence will appear just after the second '[' with which it will
combine in Unicode, possibly creating a combining sequence that breaks the
XML syntax if a Unicode normalization is applied to the document!).

If XML document generators (or editors...) are made aware of this problem,
then they will safely encode things like:

    <?xml version="1.0"?>
    <document>
       normal text and a letter e with a colored grave accent:
       e<text style="color:red;">̀</style>.
    </document>

or (less natural because the document content without markup excludes now
the diacritic):

    <?xml version="1.0"?>
    <document>
       normal text and a letter e with a colored grave accent:
       e<diacritic style="color:red;" value="̀"/>.
    </document>

The key issue solved here is that ̀ MUST NOT be replaced by its
plain-text equivalent, or it will create a non-defective combining sequence
that spans the closing '>' (first example) or the leading double-quote
(second example), which could be transformed by Unicode normalization (or
reencoding to another charset than an Unicode UTF...) applied to the whole
document, and that could then break the wellformed-ness of the document's
XML syntax.

So, until here, we have solved the first problem: being able to represent
XML data containing defective combining sequences. But the bad thing is that
we have broken the text into separate text entities.

This is where document authors are left with many ambiguities about how to
"join" visually these rendered entities, as there is now no place to
correctly position the separate diacritic with the last letter of the
previous text element.

Without knowledge or explicit specification of the precise font's {face,
size, style, weight} used to render the document, styling the diacritic with
additional positioning becomes impossible; this task should then be
performed by the document's renderer when it will have this information, but
a classic XML renderer (including HTML renderers in browsers) will behva
very poorly here, because it will render each substring with separate
invokation of the plain-text renderer (a HTML renderer normally computes a
bounding box for each text fragment, and positions these boxes side by side
on rows, but the bounding box of the diacritic should not use this
convention: this breaks the box model of HTML (or of similar rich-text
formats, including RTF or Word documents, or even PDF files).

So let's suppose that we won't attempt to style diacritics separately, so
that we will not need to encode defective combining sequences. The styling
information for the diacritic must then be coded outside of the combining
sequence, and not mixed within it:

    <?xml version="1.0"?>
    <document>
       normal text and a letter e with a colored grave accent:
       <span class="special">e#x300;</span>.
    </document>

Styling is then applied to the combining sequence as a whole, and now it
preserves the Unicode character model. But this requires some more
capabilities to the style language, in order to give separate styles to
substrings of the same text element. This is what CSS cannot perform now,
i.e. something like:

with a syntax allowing to select only the second character (at index 1) of a
text element... This will become tricky when the document itself will be
normalized, because this must select items within combining sequences,
something that CSS is not prepared to do naturally, notably because the main
document to style could have been transformed by a Unicode normalizer or a
charset reencoder:

    <?xml version="1.0"?>
    <document>
       normal text and a letter e with a colored grave accent:
       <span class="special">è</span>.
    </document>

Note that Unicode normalization or charset reencoding generates a XML
document that is not technically equivalent to the previous version of the
document: XML considers a decomposed "e#x300;" distinct from a precombined
"è"...
In the example above, there's now a single precombined character 'è', and
the CSS styler would need to understand that it should apply styling to the
accent coded within the same single character. The above sample syntax for
CSS will not work reliably.

So we need something even smarter in the CSS style language, to select text
items that are coded below the Unicode character level and below the
rich-text-format character model! Here we must give to CSS the capability of
specifying the normalization form to apply before selecting text items, or
give to CSS some additional meta-selectors that allow selecting items like
"diacritics-only".

The way with meta-selectors seems tricky, as there will be infinite number
of candidates. Specifing the normalization form to apply seems much more
simple:

So we have something workable with an example XML/HTML document:

    <?xml version="1.0">
    <html>
        <head>
            <title>Example</title>
            <style type="text/css">
                .special #text.NFD[1] { color: red; }
            </style>
        </head>
        <body>
            <p>normal text and a letter e with a colored grave accent:
            <span class="special">è</span>.</p>
        </body>
    </html>
(I ignored the optional <!DOCTYPE> declaration in this example)

The document like this respects the abstract character model of both Unicode
and HTML, and it can safely be normalized with Unicode normalizers, and
safely handled through XML parsers or generators.
[This will work as long as the text does not need to be reencoded into a
"defective" charset that lacks some diacritics or precomposed characters.
When this conversion is lossy, the special style rule will not have the
desired effect. However, one can argue that the author of such styled
document wants that its accents or diacritics be present and not be lost in
this transformation. A target charset that would not have a 'e-grave' in it
would not be appropriate to represent the document in which the author
wanted to emphasize the presence of the grave accent over a letter e. So
this is probably not a limitation, and today Unicode offers several UTFs
that will work as non-lossy charsets; who would want something else?]

One thing to conclude: CSS is currently not able to specify such extended
selectors. This is a place for extension of the CSS language, but this is
not a problem of Unicode itself. And this requires work in other areas also
out of Unicode itself:

(1) The style language (CSS in our example) must be extended to allow such
"smart" styles.

(2) To have such selectors work effectively, the style renderer must be
extended to work with plain-text renderers so that it will keep the
contextual shaping and positioning information that will allow rendering
separate substrings for the base character and the diacritics, each one with
distinct styles or colors.

(3) The plain-text renderers (that work with fonts) must extend their API to
allow rendering separate fragments of texts with a "restartable" state
between each text fragment.

(4) The rich-text-format renderer (HTML in our example) must be made aware
that its box model (used to place the various elements in a document's page)
will be now under control of the style renderer. There will be intimate
collaboration here to allow computing the various bounding boxes, and with a
"flow" layout, computing rows of items will become much more complex
(imagine it when it must also apply text justification, or when it must
autoadjust the column widths of tables...)

(5) For general XML text-handling outside of any renderer consideration, the
XML document generator or editor must have knowledge of the complete list of
combining characters that may cause problems in ANY charset considered
(including with charset reencoding).

Next message: Markus Scherer: "Re: Wide Characters in Windows and UTF16"
Previous message: Philippe Verdy: "Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)"
In reply to: Doug Ewell: "Re: Combining across markup?"
Next in thread: saqqara: "Re: Combining across markup?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 12 2004 - 15:55:26 CDT