RE: Unicode plain text

From: Frank da Cruz (
Date: Fri May 23 1997 - 10:50:06 EDT

Murray Sargent <> wrote:
> I think page breaks given by <FF> (0xC) belong in the block separator
> category and imply an end of paragraph. Page breaks that come in the
> middle of a paragraph or word should be called _soft_ page breaks much
> as we have soft line breaks. ...
This is GUI thinking. Think "plain text", no rendering engines. <FF> is a
hard, unconditional page break. Think of running off monthly paychecks on
your lineprinter, or addressing envelopes (and spelling peoples' names
correctly in hundreds of languages -- imagine that!). (Kenneth Whistler) wrote:
> > In principle, it should be just as possible to fill records with
> > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208.
> And in practice. The portable Unicode backend library I have
> written merrily reads and writes Unicode plain text into MVS and
> VMS filing systems through standard C file interfaces. No problem.
> I just don't depend on MVS or VMS to provide any specific interpretations
> of *anything* in those files, nor would I want to, to stay portable.
It's funny how the pendulum swings. Back in the old days we didn't even
have file systems, just boxes of cards. Then we developed complex file
systems based on punched-card ideas (look at your old OS/360 JCL manual).
Then we reacted against all of that complexity and said "a file is just a
stream of bytes" with imbedded control information. Now the simplicity of
the stream approach is coming back to bite us because of all the differing
interpretations of the imbedded controls, since no standard was ever set for
their use in files.

Now we see that there is something to be said for keeping the control
information out of band -- it makes it really simple to change coding systems.
But anybody who has ever done VMS Record Management System programming knows
that the price is complexity and loss of portability. You can't just "copy"
a VMS file to DOS or UNIX, you have to "export" it from the file system and
convert its record information to the appropriate stream format. Nor can you
run an RMS program on a non-VMS system.

If we had it all to do over again -- and we do -- we could retain the
simplicity of the stream model without the confusion by precisely defining
a set of controls that may be imbedded, as we have done for LS and PS.
This will allow for both portable data AND portable software.

> I agree with Tim that page breaks are on the slippery slope to pretty
> text. Pagination is not necessary for legibility of plain text in
> the same sense that line breaking (forced in some instances) or
> paragraph breaking (required among other things for bidi directional
> control) are. Furthermore, since pagination assumes much more
> about actual rendering devices, forced pagination is as often a
> source of illegibility. (Think of all those preformatted documents
> you've seen at one time or another that on your device display or print
> with one or two lines spilled over to the next page for each forced
> page.) I suspect that the device dependency of pagination is one
> of the reasons why HTML doesn't use a built-in concept of page-break
> on display or FF.
This is all true, but that does not mean there should be no such thing as
a forced page break. Paychecks. Envelopes. Like any tool, a hard page
break can be used for good or evil. It's not the tool's fault.

> Again, think device dependency here. FF used to literally be the
> electronic control for the "Form Feed" on a particular device. It
> moved a mechanical device that shoved paper out and new paper in.
Yes, we still do these things.

Murray Sargent <> said:
> But back in the '60s and early '70s we had line printers (with
> fixed-width characters) and would ship "plain-text" documents to them
> preformatted with the desired line and page breaks. Such breaks
> consisted of hard CRLFs and FFs to control the line printer, and they
> could appear in the middle of a paragraph or word. Similarly these
> codes create such breaks on most modern printers. So in this sense, an
> FF can come in the middle of a paragraph or even a word. But this
> should be something down at the printer device-driver level. It would
> be a bad choice for file storage (unless it's a printer file).
Again, printer files are common practice, and they are not sent only to
printers. They are also viewed on terminals, "straight no chaser" or in
a text editor, and they are shipped around among diverse platforms. There
is no reason to try to stamp out this practice. It has its legitimate uses.

> To date, Unicode has avoided defining control characters except for the
> TAB and NULL, precisely because there were multiple uses for these
> characters. The Unicode Standard states that "the others may be
> interpreted according to ISO/IEC 6429".
I agree that ASCII and ISO 6429 control characters are mess, and that is
why it is important to precisely define a minimal set for use in Unicode
plain text. This might be done by defining semantics for the existing C0
and C1 control characters, or by adding new ones.

This will not only make Unicode able to stand on its own, but it will
allow export and import of fancy text between incompatible GUI
applications. And it will provide a Common Intermediate Representation
for plain text that can last for decades, while the corporations slug it
out in the marketplace over their three-letter acronyms du jour.

- Frank

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT