Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg)

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Wed May 28 1997 - 18:22:14 EDT


> It is percicely this fatal one-to-one markup/application
> relationship that SGML is targeted at. SGML is very different
> beast and it is a mistake to throw it in with the rest. Claiming
> that SGML is just another transient markup language that doesn't
> address document portability ...
>
I don't think anybody did that. But this does not mean SGML can
be used for everything.

> Unlike other markup languages, SGML makes no assumptions about
> the processing application.
>
Except that it can parse SGML. I'm not arguing against SGML --
quite the opposite: I'm heavily in favor of (almost) anything that
has survived the international standards process AND sees use in
the real world, as opposed to schemes that companies make up and
unilaterally proclaim to be standards.

But SGML is to mark up text for later formatting to fit the
requirements of some output device or application that understands
this kind of markup. As distinguished from plain text as we have
known it since the 1960s, in which a repertoire of graphic
characters is mixed with a small number of control codes (call
them markup if you wish) for simple actions like line breaks and
so on, in order to achieve the *final* result, not (necessarily) to
be input for a higher-level reformatter.

> I would propose that "complete" be defined as a minimal set of
> markup codes necessary to make a document understandable by a
> human without resorting to anything outside the Unicode standard.
> Machine processing, beyond doing the Right Thing with whitespace
> should not be a criteria. Except for directional control, most of
> the necessary markup should be covered by addressing
> compatibility with ASCII, although clarification would be
> helpful.
>
Right. Something like the following (ignoring BIDI for the moment):

 . LS is a hard line break. The next graphic character appears
   at the left margin of the following line. Equivalent to CR and
   LF on a Teletype.

 . Two LSs result in a blank line.

 . Three LSs result in two blank lines, and so on.

 . PS is a hard paragraph break (more about this below).

 . <FS> (form separator), whatever its instantiation (a new
   Unicode character, or ASCII Formfeed with a well-defined use in
   Unicode), starts a new page. The next graphic character
   appears on the top line, leftmost position of the new page.

 . Two FSs result in a blank page, and so on.

Plus whatever is needed for specifying writing direction,
including expanding on what is meant by "left", "top", etc, in the
preceding items.

That should do it. Personally, I find text to be most portable
when it is displayed in fixed-width font, and spaces are used to
line things up, rather than tabs (because tabs require external
agreement about the tab settings). I don't think Vertical Tab or
other obscure formatting controls (such as Line Feed taken
literally) are of any use; in my experience they have always been
treated as "synonyms" for the controls listed above.

Then what to do about ASCII controls in Unicode text? I'd say
that since ASCII (and Latin-x, etc) must be converted to Unicode,
then it is the responsibility of the conversion agent to
understand the local conventions for line breaks (etc) in the
source text, and to convert to the well-defined Unicode controls.

About Paragraph Separator... It seems to me that this one was
designed with the "export from word processor" type of file in
mind (those files we were discussing earlier in which each
paragraph is a long line, terminated by a "paragraph separator"
such as CR). I would not call this type of file plain text --
I would call it "input for a text formatter"; it needs further
processing to be readable. (For example, if I print such a file
on the local Laserwriter, the long lines are truncated -- thus
I only see the first 80 characters of each paragraph.)

Clearly we can become increasingly epistemological about what
constitutes plain text (yes, C source code is input for a C
compiler, but it is also text to be read, understood, and edited
by people, sent by email without being reformatted, etc).

And obviously some details still need working out: treatment of
soft hyphens and such. But I think we're on the right track.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT