Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg)

From: John Fieber (
Date: Wed May 28 1997 - 17:14:10 EDT

On Wed, 28 May 1997, Frank da Cruz wrote:

> Nobody advocates stamping out higher level protocols, even if that were
> possible. We all use them all the time. I, for one, use them with my
> eyes open -- i.e. with full knowledge that all the work I put into
> creating a "rich" document will need to be done again at some point when
> the current "standard" for richness has been replaced by a new one if I
> want the document to survive. And again. And again.
> I remember the excitement when it first became possible to produce
> typeset-quality documents with Troff, R, DSR, Scribe, TeX, and their
> relatives.

The transient nature of these markup languages is not a trait of
markup languages, but a product of having a one-to-one
relationship between the markup language and a specific piece of
application software. TeX files go with TeX, troff files go with
troff, Scribe files go with Scribe, WordPerfect files go with
WordPerfect, MS-Word files go with MS-Word. If the application
falls out of favor, it takes its markup language and data with

Exactly the same thing happens if you depend on software that
uses its own unique character encoding, or the glyph encoding of
some oddball font.

It is percicely this fatal one-to-one markup/application
relationship that SGML is targeted at. SGML is very different
beast and it is a mistake to throw it in with the rest. Claiming
that SGML is just another transient markup language that doesn't
address document portability is similar to saying that Unicode is
just another transient character encoding scheme that doesn't
address multilingual computing. Absurd? Of course.

> But I also continued to produce plain-text "documents" on a
> daily basis: email; netnews; computer programs in assembly language,
> Sail, Simula, C, Fortran, Pascal, PL/I, etc;

I think we differ on the notion of "plain text" and "markup".
Lets see. In email for example, what is the difference between
this markup:

  To: Whoever@somewhere
  Subject: la de da

  blah blah blah blah...

and this markup:

  <subject>la de da</subject>

  <body>blah blah blah blah...</body>

Semantically identical. Furthermore, the correct delivery of mail
and news depends critically on markup as does netnews. However
you delimit it, it is still markup. Same for the computer
languages. What are braces, semicolons, parentheses, and comment
delimiters in C if not markup to guide the compiler in parsing
the program? Incidentally, most computer languages could be
expressed in SGML markup (although the utility would be dubious).

Unlike other markup languages, SGML makes no assumptions about
the processing application. SGML merely provides a standard way
for an application to distinguish markup from data. This allows
SGML to be used as a foundation for a much broader range of
applications and helps ensure a long life. On the other hand, as
you may guess, SGML is not a complete solution--if typesetting is
your domain, for example, you will still need some software to do
the layout of your data (TeX works quite well)--but SGML serves
to protect your data from dependencies on specific applications.

That protection facilitates exchange between applications. In
one case you feed your document to a typesetter, in another case
you feed it to a database, in a third case, an on-line document
viewer. Portability between applications extrapolates to
portability across time. HTML may be out of fashion in 20 years,
but any SGML compliant application can still process it even if
the degigners never heard of HTML. (You might have to make up a
style sheet, but that is orders of magnitude easier than the
digital archaeology required to re-invent, say troff, from a
couple sample document. SGML documents come with their own
rosetta stone--the DTD, or document type definition.)

In an SGML world, the data drives the application, not the other
way around as is the status quo currently. That is the
fundamental shift that sets SGML apart from the other markup
languages cited here as examples of why markup languages are to
be avoided when document portability is a concern.

> What is needed is emphatic allowance and support for Unicode plain text
> in the Unicode standard, i.e. a precise and thorough definition of what
> constitutes a self-contained preformatted plain-text document. This is
> primarily a matter of adopting a small but complete set of control codes
> needed for line breaks, paragraph breaks, page breaks, and direction
> control (most of these are already there), and a clear statement of the
> role of the "traditional" control characters at U+0000 - U+001F, U+007F,
> and U+0100 - U+011F.

I think the notion of "plain text" is a little muddy as these
sorts of codes represent markup that is conceptually no different
than, say, SGML. I fully agree, however, that there is room and
a historical precedent for a small set of control (markup) codes
in Unicode, but getting people to agree on what constitues
"complete" is another matter. :)

I would propose that "complete" be defined as a minimal set of
markup codes necessary to make a document understandable by a
human without resorting to anything outside the Unicode standard.
Machine processing, beyond doing the Right Thing with whitespace
should not be a criteria. Except for directional control, most of
the necessary markup should be covered by addressing
compatibility with ASCII, although clarification would be

> Plain text files can be used by many applications, but how do we mark
> them as being written in Unicode? Or Latin-1? Or JIS X 0208, etc.

SGML offers some options here by hiding file system (or any
storage mechanism) behind an entity manager which provides for
such tagging. The details are not currently covered by the
standard (which treats the entity manager pretty much as a black
box), but the entity manager in James Clark's SP system offers a
good example of how it might be done.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT