Re: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 11 2003 - 20:06:32 EDT

  • Next message: Mark Davis: "Re: Questions on ZWNBS - for line initial holam plus alef"

    Philippe replied:

    > From: "Kenneth Whistler" <kenw@sybase.com>
    > > Of course a standard which mandates space folding is also
    > > within its rights to mandate, for example, the non-use of
    > > nonspacing marks applied to SPACE characters. It can simply
    > > rule out such sequences as valid for its context, in which
    > > case the problem goes away.
    >
    > Try to change now the XML or even the HTML or SGML
    > standards!

    I'm not trying to.

    > The use of space folding was standardized and widely
    > used long before Unicode published a workable standard.

    For HTML and XML this was clearly not the case, since the
    Unicode Standard was published before either of those
    standards, and was used as the document character set for
    both.

    For SGML, I grant you the practice is older.

    > So this
    > is a unsolved problem whose Unicode is the only source!

    It isn't an unsolved problem, as I can see. Look, HTML 4.0.1
    defines ASCII space (U+0020), along with TAB, FF, and ZWSP,
    as "white space characters". The sum total I can find that
    it has to say about collapsing white space is:

    "In particular, user agents should collapse input white
    space sequences when producing output inter-word space."

    *Even if* this were to be taken as a mandate to blindly
    convert each <U+0020, U+0020> sequence into <U+0020>,
    regardless of the presence of non-spacing marks in the
    data, which I doubt is the intent of the standard, the
    fix for that would be to simply apply the combining
    mark to U+00A0 NBSP instead. U+00A0 NBSP is *not* specified
    to be a "white space character" in HTML 4.0.1, and thus
    seems not to fall under the recommendation regarding
    collapsing white space sequences.

    > Now support of this space folding is a FULL MANDATORY part
    > of the XML standard, and it is by far more important and more
    > widely used than SPACE+diacritics sequences in plain-text Unicode.

    XML 1.0, section 2.10:

    "In editing XML documents, it is often convenient to use 'white
    space' (spaces, tabs, and blank lines) to set apart the
    markup for greater readability. Such white space is typically
    not intended for inclusion in the delivered version of the
    document. On the other hand, 'significant' white space that
    should be preserved in the delivered version is common, for
    example in poetry and source code.

    An XML processor must always pass all characters in a document
    that are not markup through to the application. ...

    A special attribute named xml:space may be attached to an
    element to signal an intention that in that element
    white space should be preserved by applications. ..."

    It is perfectly reasonable, as I see it, to consider the
    <SPACE> in a <SPACE, NSM> sequence to be:
      a. significant
      b. part of the characters in a document that are not markup
         (at least in the cases we are talking about, since the
         problem is not about defining Nmtokens for markup in
         Biblical Hebrew, but rather the representation of the
         Biblical Hebrew document content itself)
         
    So I *still* don't see the problem you are on about, and even
    if there was one, the xml:space attribute could be used to
    require preservation of a particular space.

    > If Unicode members can't fix it, will the W3C need to create a
    > formal request to you?

    What are you going on about? W3C architects have been familiar
    with this Unicode convention of applying NSM's to SPACE or NBSP as
    a means of representing an isolated spacing diacritic for
    years.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 21:41:07 EDT