Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg)

From: John Fieber (
Date: Thu May 29 1997 - 00:14:03 EDT

On Wed, 28 May 1997, Frank da Cruz wrote:

> > It is percicely this fatal one-to-one markup/application
> > relationship that SGML is targeted at. SGML is very different
> > beast and it is a mistake to throw it in with the rest. Claiming
> > that SGML is just another transient markup language that doesn't
> > address document portability ...

> I don't think anybody did that. But this does not mean SGML can
> be used for everything.

No, but its useful range of applications is quite a bit wider
than any other markup scheme I know of. That helps a lot in
building a solid foundation that won't fade away.

> But SGML is to mark up text for later formatting to fit the
> requirements of some output device or application that understands
> this kind of markup.

SGML is explicitly *not* about text formatting. It is about
marking up documents describing what content *is*, not what to do
with it. If markup represents typesetting instructions, that
markup is good for little else. If your markup describes what
the content is, you have far more options.

For example, the introduction of a new term in a technical manual
may be rendered in italics. You could mark it up like: <it>new
term</it> which would be fine if the end target is a typesetter,
but if you mark it up with: <firstterm>new term</firstterm>, you
can still render it as italic, but you can also automatically add
it to the index as the defining location of the term, or in an
on-line environment if you encounter a unfamiliar term, the
search engine can seek out the defining occurence if it exists.

But back to your point:

> As distinguished from plain text as we have
> so on, in order to achieve the *final* result, not (necessarily) to
> be input for a higher-level reformatter.

Yes, though I would argue at length why SGML markup is well worth
the extra effort, I'll also agree that this minimalist approach
to document portability deserves support.

> Then what to do about ASCII controls in Unicode text? I'd say
> that since ASCII (and Latin-x, etc) must be converted to Unicode,
> then it is the responsibility of the conversion agent to
> understand the local conventions for line breaks (etc) in the
> source text, and to convert to the well-defined Unicode controls.

The only hitch for 7-bit ASCII is utf-8, which can be seen as a
convenient way to avoid the explicit conversion process of legacy
data. If your external storage is utf-8, how can you reliably
tell what has been converted and what has not?

> Clearly we can become increasingly epistemological about what
> constitutes plain text (yes, C source code is input for a C
> compiler, but it is also text to be read, understood, and edited
> by people, sent by email without being reformatted, etc).

After pondering it for awhile, I cut that section out of my last
post. :) One sentence summary: some markup scheme cater to human
processing, others to machine processing, and yet others, most
notably programming languages, work hard to satisfy both needs.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT