Re: Plain Text

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Mon Jul 05 1999 - 12:50:39 EDT


[Ed wrote...]
> It puzzles me even more, then, that Frank writes in his Unicode text
> file proposal as if Unix practice, or more particularly his own practice
> (including practice in file format conversions in cross-platform data
> transfers), is normative, not just for other software, but for file
> formats on other platforms, without saying how this norm is to be
> implemented so that file format conversion ceases to be a problem for
> all applications.
>
I'll try to be more explicit.

Whether we know it or not, text interchange methods are well-established in
the pre-Unicode world, at least at the record-format level (character sets
are another matter, but we know that).

When I sit at my { terminal, terminal emulator, xterm window } and tell the
host to "type" or "cat" a file, the internal text format is translated to
the de facto canonical one, primarily that the local convention for line
separation/termination is translated to CRLF. When I transfer a text file
with FTP or any other file transfer protocol I know about, the same thing
happens (see, e.g. RFC959).

However, many of us are confused by the fact that local conventions differ,
and perceive this as an obstacle to interchange because, for example, it is
difficult to read a PC diskette on a UNIX workstation or a Macintosh, or
because of the increasing amounts of email we get that uses some encoding or
format we don't understand.

These are problems that we have an opportunity to solve in the conversion of
8-bit text to Unicode.

> How do we get agreement on such a standard from, e.g., Microsoft?
>
Hopefully Microsoft's representatives to the Unicode Consortium will be
supportive, as some of the commentary already seems to indicate.

> How do we get users to stop using current methods?
>
We don't have to. If the Unicode Standard defines what plain text is,
then conversion of 8-bit text to Unicode will put all the divergent
platform-specific formats into the same Unicode format.

> How do we deal with delimited database transfer files with a fixed
> limit on line length?
>
I don't see how these files would be affected. You can put line separators
in them if you want, or leave them out.

> How do we deal with legacy data?
>
How do convert existing 7-bit and 8-bit plain-text files to Unicode plain
text? The straightforward conversion is:

 . Source line -> Destination line terminated by LS.

This is according to whatever the local definition of "line" is (UNIX,
Macintosh, DOS, VMS, MVS, ...). And of course:

 . Source character set converted to Unicode.

This seems obvious. C0 control characters are kept, including Horizontal
Tab and Form Feed. C1 control characters are kept if the source character
set has them (e.g. a Latin Alphabet) and translated otherwise (e.g. CP850).

Additional wrinkles (options) might include:

 . Tabs expanded to spaces based on the desired tab stops, which should
   be 1,9,17,35,... BY DEFAULT (meaning you can supply your own tab stops).

 . Heuristics might be used to identify paragraphs and to separate them
   by Paragraph Separator. For example, a blank line is replaced by PS.
   Obviously there are pitfalls.

 . Any conversion program would probably need an option to deal with
   files with "word processor" record format, in which a line is really
   a paragraph.

> I find myself dealing with Unicode text created by Windows and Windows
> applications quite frequently now, with line ends marked in
> little-endian fashion as
>
> 0D 00 0A 00
>
> What do we do about that?
>
I would say that this practice should be discouraged ("be conservative in
what you 'send'") in any application that creates or saves Unicode text
files. But it should be allowed for ("be liberal in what you 'receive'") in
any conversion/import program.

> I entirely agree that cross-platform protocols should be defined so that
> we stop having conversion problems (such as translating text file formats
> upon transfer, as ftp does), but it can't be done within a character set
> standard, nor by defining a text file format without file format handling
> for applications on different platforms.
>
I don't think anybody can presume to offer a panacea for differing
application formats, other than to define a text-file format that can be
used for export/import/interchange, as we have now with most popular
applications. We simply need to extend this idea to Unicode.

> I have had to collect or in some cases write conversion routines for text
> file transfer, including text files in ASCII, 8-bit character sets, and
> Unicode. I would much rather have the operating systems do it.
>
The operating system doesn't know what format or encoding is used in a
file. It would be nice if this information was saved along with the file,
but it usually isn't. If, in the transition to an all-Unicode computing
environment, we specify not only the encoding but also a standard record
format for interchange of plain text -- including (but not requiring)
preformatted plain text -- we won't have to worry about operating systems,
file systems, or presentation-layer issues in text-file transfer ever
again.

Obviously we will always have to worry about format conversions between
applications that do NOT use plain text data files. But by defining a
low-level baseline format for plain text, there will always be a method for
recording and transmitting textual information that rises above ("sinks
below") those differences, and that can always be used across platforms,
distance, and time.

> ... You acknowledge the need for flavors of text
> other than your preformatted plain text. I thought you were holding out
> for one flavor only. Now we can discuss the flavors, such as delimited
> database interchange files with lines of arbitrary length. Presumably we
> can define them using some of the apparatus that is becoming available in
> XML or as MIME data types.
>
No, thase are higher-level protocols that will go out of fashion some day,
probably sooner than you think. Of course you can define or use all the
higher level protocols you want, but you should bear in mind they are
ephemeral. If you want something that lasts forever, do it in Unicode
without reference to MIME, *ML, or anything else, and keep it extremely
simple.

> To summarize your answer to my objections, we are defining a new format
> independent of previous conventions, in which we can specify usage of the
> minimal set of formatting characters regardless of usage in text files of
> 7-bit ASCII and 8-bit character sets of any kind, while allowing for a few
> variant flavors of text, such as preformatted, reflowable, and
> database.
>
Yes.

> To which I add, that we can specify a portable implementation,
> too, and not have to wait for computer and OS vendors to get on board.
>
Double yes.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT