Re: Plain Text

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Mon Jul 05 1999 - 04:44:26 EDT


At 10:51 -0700 7/4/1999, Frank da Cruz wrote:
[Ed Cherlin wrote:]
>> I conclude that I disagree with Frank's attempts to make his own limited
>> experience normative...
[snip]
I withdraw the remark, in view of other information received, and the
answers to my objections which Frank has provided, like the next.
[Frank]
>> >So who cares what the file format is -- except of course when we want to
>> >transfer the file to another platform.
>> >
>> >In that case, it is the
>> >responsibility of each file-transfer agent
>>
[Ed]
>> When reading floppy disks?
>>
[Frank]
>Of course. One of the biggest problems facing any of us who wishes to live
>in a world of computing diversity is the failure of file system designers to
>develop a rational method for tagging files, and indeed, for developing
>standard interchange formats. That's what we're trying to do here.
>
>Consider a minimal platform like DOS. You can set up your DOS system to
>load different code pages, such as CP850 for West European languages, CP866
>for Cyrillic, and so on. Then you can use standard DOS utilities to create
>and edit text files in many languages (but only one per file). However, no
>record is kept of the encoding (character set) of each file. This presents
>rather significant problems even when we stay on the PC, before we ever
>think about interchanging files.
>
>So at minimum, a text file should be tagged according to character set. To
>my knowledge, this has never been done at the file-system level.
>
>What about file type and record format? Data interchange can be done in
>various ways. One way involves cooperating agents at each end -- e.g. FTP
>client and server. They can use their own application-specific protocol
>to control the process. For example, one can say "I'm DOS" and the other
>"I'm UNIX" and then apply the appropriate conversions. Of course as
>platforms multiply, we have an n x n problem. Therefore we settle upon
>standard formats to be used on the wire. Each transfer partner converts to
>and from these standard formats.
>
>Moving files by magnetic media present numerous problems, but only because
>we have forgotten how to do it. Back in the 1970s, ANSI developed standards
>for data interchange by magnetic media (e.g. ANSI X3.26-1978) that worked
>perfectly well until the personal computer revolution came along and
>standards went out of style. A DOS (or Macintosh or IRIX or any other)
>diskette is simply not intended for export to other platforms.
>
>This is the kind of situation we would like to avoid in the future. Hence
>this discussion.
>
>> You are still claiming that text files as they occur in your computer
>> subculture are for some reason normative for the rest of us.
>>
>Actually I am attempting to achieve an agreement a precise definition of
>Unicode plain text that allows the text to be already formatted, one that
>gives us the same capability that we have always had with ASCII (and Latin-x
>etc) of encoding and presenting information without *requiring* the use of
>any higher intelligence beyond what is needed to interpret Space, LS, PS,
>HT, and FF characters, plus whatever else is needed to accommodate bidi,
>etc.
[snip]
[Frank]
>> >Whether I want my email
>> >reformatted by your client should be my choice, since only I know what my
>> >intentions are in sending it. ^^^^^^^^^
>>
>> However, it actually is the recipient's choice, and you can't stop us.
>>
>This sounds like quibbling but it's an important point. If I have the
>capability to compose and format a plain-text message exactly as I want you
>to see it, the mail system should allow me to mark it as "preformatted plain
>text" and then you would have to go out of your way to reformat it. Whereas
>if my mail client sends long lines with no formatting, it should mark it as
>"plain text to be flowed".

This is the key point for me. You acknowledge the need for flavors of text
other than your preformatted plain text. I thought you were holding out for
one flavor only. Now we can discuss the flavors, such as delimited database
interchange files with lines of arbitrary length. Presumably we can define
them using some of the apparatus that is becoming available in XML or as
MIME data types. Would it make sense, then, to create a formal XML
definition of plain text files, with a leading BOM, no interpretations for
any tags, the minimum set of control characters, and the appropriate set of
transformation formats? That would get around my earlier objection, about
how to make an implementation available on all platforms. What about
corresponding MIME types?

>Email issues, especially MIME, are a whole new topic, and a controversial
>one, best avoided here. But a clear statement from the Unicode Consortium
>on plain text that addresses the issue of formatting might motivate the
>"email community" to deal with these issues in a productive way.
>
>> A growing number of standards specify the use of Unicode text files,
>> without explicitly defining them. If we get anywhere with this, we will
>> have to run our proposal past these other groups, including the IETF, the
>> POSIX committee, programming language standards committees, etc.
>>
>Good. Let's try to keep making progress.
>
>We all have an intuitive grasp of the meaning of preformatted plain text.
>You'll find it in many places:
>
> . READ.ME files on your software disks.

Preformatted or reflowable.

> . Program source code.

Preformatted.

> . Traditional (not "legacy") email and netnews.

There is presently no way to specify preformatted or reflowable.

> . Voluminous full-text information already online.

Including Unicode tables and other database interchange formats.

>and so on. We should find a way to carry this notion forward for Unicode
>in a way that:
>
> . Avoids the pitfalls of platform-dependent formatting conventions.
>
> . Allows straightforward and unambiguous conversion of 8-bit data to
> Unicode (and, to the extent possible, vice-versa).
>
> . Is independent of any higher-level protocol, markup language,
> product, or even standard. In other words, the Unicode definition
> should stand entirely on its own so that files encoded (or transmitted)
> in this format will be universally understood for years, decades,
> centuries to come, no matter what else might change, as long as Unicode
> itself lives on.

Hear, hear.

>- Frank

To summarize your answer to my objections, we are defining a new format
independent of previous conventions, in which we can specify usage of the
minimal set of formatting characters regardless of usage in text files of
7-bit ASCII and 8-bit character sets of any kind, while allowing for a few
variant flavors of text, such as preformatted, reflowable, and database. To
which I add, that we can specify a portable implementation, too, and not
have to wait for computer and OS vendors to get on board.

Well, apparently there are no hard feelings from Frank over my earlier
harsh words, so perhaps nobody else need be offended on his behalf. In case
anybody missed it elsewhere, I apologize for misunderstanding Frank, and
for giving the impression that I was attacking him personally.

--
Edward Cherlin                        President
Coalition Against Unsolicited Commercial E-mail
Help outlaw Spam.       <http://www.cauce.org/>
Talk to us at             <news:comp.org.cauce>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT