Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

From: Philippe Verdy (
Date: Wed Aug 06 2003 - 06:54:35 EDT

  • Next message: Mike Meir: "RE: [indic] Re: Top Nukta... and double nuktas ... and more nuktas"

    On Wednesday, August 06, 2003 1:59 AM, Curtis Clark <> wrote:

    > on 2003-08-05 15:31 Peter Kirk wrote:
    > > Thank you, Mark. This helps to clarify things, but still doesn't
    > > explicitly answer my question of how to encode "a sentence like "In
    > > this language the diacritic ^ may appear above the letters ...",
    > > but instead of ^ I want to use a combining character" and want to
    > > display exactly one space before the combining character - do I
    > > encode two spaces or one?
    > In this language the diacritic ̊ may appear above the letters...
    > Two spaces, at least in Thunderbird Mail.

    The NFD decompositions of spacing marks is alredy defined as a SPACE
    plus a non-spacing combining character. This officially documents the
    usage of SPACE as a base character, and its use in combining sequences.
    In the context of XML processing, where strings should (must?) be
    presented in NFC form, this extra SPACE will be invisible, hidden within the
    precomposed sequence, so this space does not have the line-breaking

    Breaking properties apply only to combining sequences, not to isolated
    encoded characters. It's illegal to break in the middle of a combining
    sequence. So as soon as a SPACE is followed by a combining character,
    it looses its breaking properties, as those properties are only defined for
    the combining sequence containing only a SPACE. So I don't think there's
    any ambiguity: parsers and renderers must correctly identify combining
    sequences before applying any algorithm.

    This means that an algorithm like normalization of whitespace sequences
    in XML or HTML should not include SPACEs that are used as base
    characters in a combining sequence, and so it should keep two spaces
    if the intent is to encode a logical space followed by a logical spacing
    diacritic. (This is not a problem for XML which processes strings in their
    NFC form).

    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.

    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 07:30:16 EDT