Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

From: Philippe Verdy (
Date: Wed Aug 06 2003 - 09:32:37 EDT

    On Wednesday, August 06, 2003 12:36 PM, Kent Karlsson <> wrote:

    > > The NFD decompositions of spacing marks is alredy defined as a SPACE
    > > plus a non-spacing combining character.
    > Philippe, please! Those are *compatibility* decompositions. The
    > normal form NFD only uses *canonical* decompositions. And there is no
    > such thing as "NFD decompositions".

    Sorry for the confusion. Still even with a NFKD decomposition, it is clear that
    they already define combining sequences with the SPACE used as a base
    character. The real important thing is that the SPACE is already the base
    character already used as a combining mark holder, and Unicode processing
    should only be done without breaking in the middle of a combining sequence
    even in the case of a SPACE base character.

    It's true that not all (only most) combining non-spacing characters have a
    non-combining spacing counterpart. But when they exist, the decompositions
    proposed in the UCD are already an indication that the SPACE character
    should be preserved and not considered for break oppotunities if it is followed
    by a combining character. It is not extremely clear in the specification break
    properties where sequences of spaces are often unified, but there's already
    some rules that make it clear: a SPACE is a word separator only if not used
    in a combining sequence, and break opportunities are computed between
    grapheme clusters which cannot break a combining sequence.

    OK there's a problem with HTML, where sequences of whitespaces are
    normalized to a single whitespace, and this effectively creates a problem
    if a combining character is used after two spaces: the first one being a
    word separator or indenting space, the second being a base for the
    combining sequence. For now, most text can be created using spacing
    diacritics instead of combining sequences starting by SPACE, and this
    will work in HTML.

    For those diacritics which do not have a spacing counterpart already
    defined, there remains a problem which can only be solved using a
    separating format control between the first (separating)
    space and the second (base) space. I think this could be a ZWSP
    like this:


    (provided that the whitespace normalization algorithm will not
    include <ZWSP> in the whitespaces sequence and treat it
    isolately, something that a conforming HTML or XML processor
    should not do, as it should unify only sequences of <SPACE>,
    <TAB>, <CR>, <LF>, and only according to the context of the
    containing element whitespace properties controlling the
    normalization of XML whitespace sequences (leading, trailing,
    line break preservation, tabulator)...

    I did no verify completely in XSLT but this should be true too
    there for this kind of processing (hoping that ZWSP will not
    be considered in whitespace sequences)

