Re: Backslash n [OT] was Line Separator and Paragraph Separator

From: John Cowan (cowan@mercury.ccil.org)
Date: Wed Oct 22 2003 - 07:08:48 CST


Kent Karlsson scripsit:
>
>
> John Cowan wrote:
> > XML 1.1 will treat CR, LF, NEL, <CR, LF>, <CR, NEL>, and LS as line
> > terminators and report them all as LF. PS is left alone, because of
> > the bare possibility that it is being used as quasi-markup.
>
> I'm not sure why <CR, NEL> should be seen as a single line end.

The IBM people, who are authoritative about their own mainframes, asked
for it. It primarily arises out of semi-broken conversion programs
that map LF to NEL but fail to remove a preceding CR. Since all line
terminators are inherently a matter of legacy (i.e. de facto) practice,
we accepted it.

> And I think PS should be seen as a line end for XML too.
> It, like LS, can be used to format the XML source, but should not
> be interpreted as other than line end when parsing the XML source.

We are not here concerned, as the UAX is, with when to stop reading
characters in a read-line routine. We are concerned with which
distinctions to hide in the name of simplicity. Our predecessors
considered that the differences between CR, <CR, LF>, and LF were
non-semantic, and somewhat arbitrarily chose LF as the character to be
passed to applications. We decided that <CR, NEL>, NEL, and LS had
this same semantic. But PS and FF and VT have their own semantics,
and we did not consider it justifiable to make it impossible for XML
applications to receive and process them.

> E.g., PS is not a begin-end markup, which all other XML markup is;
> nor do I know of a way of attaching "style" to a PS, like can be done
> for <p></p> etc.

PS is strictly analogous to an XML empty-tag without attributes.
While it is traditional in SGML/XML to use container elements for
paragraphs, there is no necessity to do so.

> Following (ex-) UAX 14 fully, FF and VT should be seen as line
> separtors too. Though they are unlikely in XML source files.
> FF shouldn't be interpreted as generating a page break in the
> "styled output" of an XML file, should it?

It should be interpreted however the application chooses to interpret it.
Arbitrarily turning it into a LF makes it impossible for the application
to interpret it at all.

> > I can't imagine why EOF should be called a line terminator, except
> > in the sense that a "read a line" operation should obviously
> > not attempt to read past EOF.
>
> There have been Unix programs that (mistakenly, I'd say) *discarded*
> the last (possibly partial) line of input, just because it had no LF at
> its end... And LS it's a separator, not a terminator, so EOF has to be a
> line terminator.

It would be a corruption of the input to infer a LF at the end of a
document.

-- 
First known example of political correctness:   John Cowan
"After Nurhachi had united all the other        http://www.reutershealth.com
Jurchen tribes under the leadership of the      http://www.ccil.org/~cowan
Manchus, his successor Abahai (1592-1643)       jcowan@reutershealth.com
issued an order that the name Jurchen should       --S. Robert Ramsey,
be banned, and from then on, they were all         _The Languages of China_
to be called Manchus."


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST