Re: Should furigana be considered part of "plain text"?

From: Edward Cherlin (
Date: Tue Jul 04 2000 - 02:12:27 EDT

At 11:05 AM -0800 7/2/00, John Hudson wrote:
>At 09:16 AM 7/2/00 -0800, Doug Ewell wrote:
> >The problem with the phrase "plain text ceases to be plain if you decide
> >that layout information needs to be encoded" is the word "layout." In
> >the broadest sense, line and paragraph separation could be considered
> >"layout," and nobody would suggest doing away with the plain-text
> >characters needed to control those functions.

The problem with the phrase "plain text" is that it is a polite
fiction. ASCII characters, printing and non-printing, originated as
commands to printers. What we originally called plain text files are
those that would give reasonable results when printed on an ASCII
teleprinter used as a terminal. The mechanical functions of Teletypes
defined the original semantics of the control characters used in text
files, and since carried over to screen and laser printer output--

CR Carriage Return Move printing point to beginning of current line.
LF Line Feed Move printing point down one line.
BS Back Space Move printing point one space left, unless at left limit.
HT Horizontal Tab Move printing point right to next tab stop, unless at
                         right limit.
FF Form Feed Move printing point to top of next page.

and is the reason why many of us call CR-LF either a line or
paragraph break today. Explicit line breaks were, of course,
essential on the original devices. Both CR and BS were routinely used
for overstriking.

The semantics of these and other ASCII control characters have been
changing with technology. *Some* computer system designers, noticing
that the demands of printing terminals were not requirements on
system file internals, chose to use either CR alone or LF alone for
line or paragraph ends, all without coordination. Line breaks in
files became optional on systems that provided word wrap on output or
display. Users were given options for setting tab stops, margins, and
page lengths. Character 7F, DEL, originally meant "not a character;
deleted" on punched paper tape, but began turning into destructive
backspace even before tape died. ESC has undoubtedly mutated the
most. The use of 1A SUB for end of file in several operating systems
including PCDOS is a violation of the ASCII standard, which provides
both 03 ETX (End of Text) and 04 EOT (End of Transmission), but who

There are now numerous incompatible formats bearing the name "plain
text". Some are distinguished by the choice of line end string. In
some cases, line ends are required, especially if there is a maximum
line length. Lines of unlimited length may represent paragraphs or
database records. Character sets other than ASCII may be used,
especially 8859-1 or Windows code page 1252. These days, people want
to be able to use any coded character set and still call it plain
text. In fact, people want to introduce all kinds of markup,
including furigana/rubi, language tags, ligature marking, and even
character set shift sequences (not just the poky SI and SO), and
still call the result plain text.

>I think this is a fair comment, if one assumes so broad a sense of
>'layout'. On the other hand, I wouldn't consider a paragraph break to be
>necessarily 'layout', since it is primarily a textual convention that can
>be represented in layout in a myriad of different ways: double spacing,
>indentation, pilcrows, etc.. Now, we have interpreted a paragraph break in
>a particular way in plain text code -- a hard break and a move to a new
>line, i.e. the behaviour of a typewriter 'return' key --

by way of the Teletype

>and have further
>muddied things by using this code to force layout by, for instance,
>entering two paragraph breaks
>to achieve this particular layout.

The use of tabs, spaces, CR, and LF to lay out "plain text" is
necessary in mail and news, and a total pain in documents that will
need to be converted to anything else.

>Personally, I think a truly plain text paragraph break would have no
>particular layout behaviour associated with it; rather, it would indicate a
>textual break that would be interpreted by applications according to user
>defined layout preferences. In e-mail, it is handy to have paragraphs
>separated by a 'double return', especially when several correspondents are
>being quoted, but elsewhere I would prefer indented, single-spaced
>paragraphs. Since it is the same textual break that is being indicated, I
>don't think these two layout options should be differently encoded. I think
>equating a digital paragraph break with the return key on a manual
>typewriter is actually a failure to encode plain text.

It is too late for such simple solutions. If we want to have a
standard for plain text, we have to provide for each of the common
usages. We have tried to start such a project twice on this list, and
have failed utterly both times.

>That said, I realise that this might be an extremist view, and I certainly
>don't expect anybody to change anything now. Although I have to add, as
>someone who has typeset books, that having to remove all the double returns
>in a document before I can properly control the paragraph breaks is almost
>as annoying as replacing multiple tabs or word spaces when these have been
>used to force layout in 'plain text'. Thank goodness for macros.

Hear, hear. I have wasted a remarkable amount of time over the years
on reformatting Word documents into FrameMaker. The "pain text" [sic]
markup habits of engineers are responsible for most of the work in
those conversions. Thank goodness for book-wide search and replace in
FM 6.

>John Hudson
>Tiro Typeworks
>Vancouver, BC

Edward Cherlin, Spamfighter <>
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT