RE: Unicode plain text

From: Murray Sargent (
Date: Thu May 22 1997 - 23:50:06 EDT

But back in the '60s and early '70s we had line printers (with
fixed-width characters) and would ship "plain-text" documents to them
preformatted with the desired line and page breaks. Such breaks
consisted of hard CRLFs and FFs to control the line printer, and they
could appear in the middle of a paragraph or word. Similarly these
codes create such breaks on most modern printers. So in this sense, an
FF can come in the middle of a paragraph or even a word. But this
should be something down at the printer device-driver level. It would
be a bad choice for file storage (unless it's a printer file).

To date, Unicode has avoided defining control characters except for the
TAB and NULL, precisely because there were multiple uses for these
characters. The Unicode Standard states that "the others may be
interpreted according to ISO/IEC 6429". Nevertheless, Frank's
recommendation that Unicode fill in some of the other control-character
semantics seems compelling, if only on a recommendation basis. We
could, for example, enumerate the most common usages of the control
characters CR, LF, VT, and FF in contemporary software.


> -----Original Message-----
> From: Unicode Discussion []
> Sent: Thursday, May 22, 1997 6:27 PM
> To: Multiple Recipients of
> Subject: RE: Unicode plain text
> I think page breaks given by <FF> (0xC) belong in the block separator
> category and imply an end of paragraph. Page breaks that come in the
> middle of a paragraph or word should be called _soft_ page breaks much
> as we have soft line breaks. We could talk about adding an optional
> page-break analogous to the optional hyphen (0xAD), but computer
> folklore of the years clearly indicates that <FF> shouldn't be
> overloaded for this purpose. (Off hand, I don't think an optional
> pagebreak would be a useful code to have, since you'd really like to
> have the semantic "eject if within n lines of the page bottom." Such
> a
> semantic requires the number n, which doesn't fit into a single code
> position.)
> Murray
> > -----Original Message-----
> > From: Unicode Discussion []
> > Sent: Thursday, May 22, 1997 4:00 PM
> > To: Multiple Recipients of
> > Subject: Re: Unicode plain text
> >
> > > How do record oriented file systems fit into this discussion ?
> > > (Remember those file systems that ruled the world before the UNIX
> > > idea of the byte stream came along...)
> > >
> > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name
> > two, are still widespread. But VM/CMS and other IBM mainframe
> > and midrange operating systems use EBCDIC text encoding and I am
> > not aware of any movement to support Unicode in this setting,
> > at least not internally.
> >
> > In VMS, most text files are record oriented -- usually variable
> > length records, with end of line *implied* for each record, but
> > not recorded in any particular format. This is actually quite a
> > sensible approach, given the wide variety of text-stream formats
> > that abound for no good reason.
> >
> > In principle, it should be just as possible to fill records with
> > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208.
> >
> > The VMS file system also supports the notion of "carriage control",
> > of which there are many types (like the once-familiar Fortran
> > Hollerith style, in which the first character specified whether the
> > line was to overprint the previous line, appear on the next line,
> > appear 2 lines down, etc, or start on a new page). The carriage
> > control information, again, is separate from the file's data. So
> > again, in principle, there should be no clash with Unicode.
> >
> > In fact, I think a VMS implementation of Unicode text might be an
> > interesting exercise. But this too begs the question of how to
> > map Unicode plain text into this environment, which in turn calls
> > for a Unicode plain-text standard for such things as page breaks.
> >
> > And no, I don't think this brings us anywhere near any slippery
> > slopes.
> > Page breaks have been an integral part of plain text since the 1950s
> > when we were programming IBM 409 Electric Accounting Machines by
> > sticking little wires into plugboards.
> >
> > - Frank

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT