Re: Plain Text

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Sun Jul 04 1999 - 13:56:54 EDT

Next message: Michael Everson: "Re: dotless j"
Previous message: Roozbeh Pournader: "Re: dotless j"
Maybe in reply to: Markus Kuhn: "Re: Plain Text"
Next in thread: John Cowan: "Re: Plain Text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I conclude that I disagree with Frank's attempts to make his own limited
> experience normative...
>
I'm not sure why my experience has become an issue in this discussion but
I can assure you I have a fair amount. My first programming experience
was with plugboards and little wires on IBM EAM equipment. My current
project, now in its 18th year, is precisely the interchange of text among
divergent platforms, with full conversion of both record format and
character set. I have written software to do this that has run, at one
time or another, on more than 700 different hardware-and-OS platforms,
many long dead, and at present on more than 150. This project, which I
manage, also produces (and/or collects and distributes, and supports)
similar software written by other people both here and abroad, and the
entire collection spans practically every computer and operating system
that has existed over the past 25-30 years with just a few exceptions.

Part of the project is the definition of a protocol for meaningful text
transfer. The protocol requires conversion of local formats and character
sets to standard ones when sending, and the reverse procedure when
receiving. Only international standard character sets are used on the
wire, and are tagged using standard ISO-registered identifiers. This
protocol has been in production for more than 10 years and is used in many
parts parts of the world, especially Eastern and Western Europe, Isreal,
Greece, the former USSR, Japan, and the Americas.

One of the key questions in designing and implementing such a protocol is
"what is a text file?" What distinguishes it from a non-text, or "binary"
file? Constant day-to-day experience with a worldwide user base helps me
to form what I hope is an adequate grasp of the issues.

> >The point is, it doesn't matter. Each platform has its own format for
> >internal use, but a standardized interface to the outside world. To
> >further demonstrate this fact, if I then tell the computer on the far
> >end to "type" or "cat" the file, it will, invariably, send:
> >
> > This is a line<CR><LF>
>
> Your cultural ignorance/sheltered life-experience is showing. *You* may
> live in an environment where these changes are made automatically, but a
> lot of us don't.
>
Then please give counterexamples.

> >So who cares what the file format is -- except of course when we want to
> >transfer the file to another platform.
>
> And since I don't use a VT100 simulator anymore, I only encounter this
> issue when transfering files to another platform, and as a result I care
> all the time.
>
> >In that case, it is the
> >responsibility of each file-transfer agent
>
> When reading floppy disks?
>
Of course. One of the biggest problems facing any of us who wishes to live
in a world of computing diversity is the failure of file system designers to
develop a rational method for tagging files, and indeed, for developing
standard interchange formats. That's what we're trying to do here.

Consider a minimal platform like DOS. You can set up your DOS system to
load different code pages, such as CP850 for West European languages, CP866
for Cyrillic, and so on. Then you can use standard DOS utilities to create
and edit text files in many languages (but only one per file). However, no
record is kept of the encoding (character set) of each file. This presents
rather significant problems even when we stay on the PC, before we ever
think about interchanging files.

So at minimum, a text file should be tagged according to character set. To
my knowledge, this has never been done at the file-system level.

What about file type and record format? Data interchange can be done in
various ways. One way involves cooperating agents at each end -- e.g. FTP
client and server. They can use their own application-specific protocol
to control the process. For example, one can say "I'm DOS" and the other
"I'm UNIX" and then apply the appropriate conversions. Of course as
platforms multiply, we have an n x n problem. Therefore we settle upon
standard formats to be used on the wire. Each transfer partner converts to
and from these standard formats.

Moving files by magnetic media present numerous problems, but only because
we have forgotten how to do it. Back in the 1970s, ANSI developed standards
for data interchange by magnetic media (e.g. ANSI X3.26-1978) that worked
perfectly well until the personal computer revolution came along and
standards went out of style. A DOS (or Macintosh or IRIX or any other)
diskette is simply not intended for export to other platforms.

This is the kind of situation we would like to avoid in the future. Hence
this discussion.

> You are still claiming that text files as they occur in your computer
> subculture are for some reason normative for the rest of us.
>
Actually I am attempting to achieve an agreement a precise definition of
Unicode plain text that allows the text to be already formatted, one that
gives us the same capability that we have always had with ASCII (and Latin-x
etc) of encoding and presenting information without *requiring* the use of
any higher intelligence beyond what is needed to interpret Space, LS, PS,
HT, and FF characters, plus whatever else is needed to accommodate bidi,
etc.

> >Prior to the advent of
> >word processors, the idea of "long line as paragraph" never came up.
>
> Word processing began in the 1960s. I gather you had a later date in mind.
> Did you mean specifically WYSIWYG word processors, invented at Xerox in the
> late 1970s?
>
And, before it, NLS, used at government research institutes in the 1960s.
But again, that's not plain text. It's "input for a text formatter". It
does not stand on its own.

> >No, a correct email client will leave it alone. Whether I want my email
> >reformatted by your client should be my choice, since only I know what my
> >intentions are in sending it. ^^^^^^^^^
>
> However, it actually is the recipient's choice, and you can't stop us.
>
This sounds like quibbling but it's an important point. If I have the
capability to compose and format a plain-text message exactly as I want you
to see it, the mail system should allow me to mark it as "preformatted plain
text" and then you would have to go out of your way to reformat it. Whereas
if my mail client sends long lines with no formatting, it should mark it as
"plain text to be flowed".

Email issues, especially MIME, are a whole new topic, and a controversial
one, best avoided here. But a clear statement from the Unicode Consortium
on plain text that addresses the issue of formatting might motivate the
"email community" to deal with these issues in a productive way.

> A growing number of standards specify the use of Unicode text files,
> without explicitly defining them. If we get anywhere with this, we will
> have to run our proposal past these other groups, including the IETF, the
> POSIX committee, programming language standards committees, etc.
>
Good. Let's try to keep making progress.

We all have an intuitive grasp of the meaning of preformatted plain text.
You'll find it in many places:

. READ.ME files on your software disks.

. Program source code.

. Traditional (not "legacy") email and netnews.

. Voluminous full-text information already online.

and so on. We should find a way to carry this notion forward for Unicode
in a way that:

. Avoids the pitfalls of platform-dependent formatting conventions.

. Allows straightforward and unambiguous conversion of 8-bit data to
Unicode (and, to the extent possible, vice-versa).

. Is independent of any higher-level protocol, markup language,
   product, or even standard. In other words, the Unicode definition
   should stand entirely on its own so that files encoded (or transmitted)
   in this format will be universally understood for years, decades,
   centuries to come, no matter what else might change, as long as Unicode
   itself lives on.

- Frank

Next message: Michael Everson: "Re: dotless j"
Previous message: Roozbeh Pournader: "Re: dotless j"
Maybe in reply to: Markus Kuhn: "Re: Plain Text"
Next in thread: John Cowan: "Re: Plain Text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT