Re: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 11 2003 - 20:06:32 EDT

Next message: Mark Davis: "Re: Questions on ZWNBS - for line initial holam plus alef"

Previous message: Chris Jacobs: "Re: Newbie Question - what are all those duplicated characters FO R?"
Maybe in reply to: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Reply: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe replied:

> From: "Kenneth Whistler" <kenw@sybase.com>
> > Of course a standard which mandates space folding is also
> > within its rights to mandate, for example, the non-use of
> > nonspacing marks applied to SPACE characters. It can simply
> > rule out such sequences as valid for its context, in which
> > case the problem goes away.
>
> Try to change now the XML or even the HTML or SGML
> standards!

I'm not trying to.

> The use of space folding was standardized and widely
> used long before Unicode published a workable standard.

For HTML and XML this was clearly not the case, since the
Unicode Standard was published before either of those
standards, and was used as the document character set for
both.

For SGML, I grant you the practice is older.

> So this
> is a unsolved problem whose Unicode is the only source!

It isn't an unsolved problem, as I can see. Look, HTML 4.0.1
defines ASCII space (U+0020), along with TAB, FF, and ZWSP,
as "white space characters". The sum total I can find that
it has to say about collapsing white space is:

"In particular, user agents should collapse input white
space sequences when producing output inter-word space."

*Even if* this were to be taken as a mandate to blindly
convert each <U+0020, U+0020> sequence into <U+0020>,
regardless of the presence of non-spacing marks in the
data, which I doubt is the intent of the standard, the
fix for that would be to simply apply the combining
mark to U+00A0 NBSP instead. U+00A0 NBSP is *not* specified
to be a "white space character" in HTML 4.0.1, and thus
seems not to fall under the recommendation regarding
collapsing white space sequences.

> Now support of this space folding is a FULL MANDATORY part
> of the XML standard, and it is by far more important and more
> widely used than SPACE+diacritics sequences in plain-text Unicode.

XML 1.0, section 2.10:

"In editing XML documents, it is often convenient to use 'white
space' (spaces, tabs, and blank lines) to set apart the
markup for greater readability. Such white space is typically
not intended for inclusion in the delivered version of the
document. On the other hand, 'significant' white space that
should be preserved in the delivered version is common, for
example in poetry and source code.

An XML processor must always pass all characters in a document
that are not markup through to the application. ...

A special attribute named xml:space may be attached to an
element to signal an intention that in that element
white space should be preserved by applications. ..."

It is perfectly reasonable, as I see it, to consider the
<SPACE> in a <SPACE, NSM> sequence to be:
  a. significant
  b. part of the characters in a document that are not markup
     (at least in the cases we are talking about, since the
     problem is not about defining Nmtokens for markup in
     Biblical Hebrew, but rather the representation of the
     Biblical Hebrew document content itself)

So I *still* don't see the problem you are on about, and even
if there was one, the xml:space attribute could be used to
require preservation of a particular space.

> If Unicode members can't fix it, will the W3C need to create a
> formal request to you?

What are you going on about? W3C architects have been familiar
with this Unicode convention of applying NSM's to SPACE or NBSP as
a means of representing an isolated spacing diacritic for
years.

--Ken

Next message: Mark Davis: "Re: Questions on ZWNBS - for line initial holam plus alef"
Previous message: Chris Jacobs: "Re: Newbie Question - what are all those duplicated characters FO R?"
Maybe in reply to: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Reply: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 21:41:07 EDT