re:Multi-Lingual Project Gutenberg (was: Unicode plain text)

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Wed May 28 1997 - 10:27:33 EDT


> From other replies I've received I guess I wasn't clear about my
> point. Within the domain of "plain text" Unicode is doing a lot
> to raise the common denominator. This is great, but a sentiment
> has been expressed in this thread that higher level protocols are
> a hopeless mess and if you want portability, stick with plain
> text. In the near term that may be a reality but Unicode was
> born out of frustration with the existing mess of character
> encoding standards and a determination to make things better.
>
> I was simply making the observation that swearing off high level
> protocols because they are messy now seems very out of character
> with the spirit of Unicode.
>
Nobody advocates stamping out higher level protocols, even if that were
possible. We all use them all the time. I, for one, use them with my
eyes open -- i.e. with full knowledge that all the work I put into
creating a "rich" document will need to be done again at some point when
the current "standard" for richness has been replaced by a new one if I
want the document to survive. And again. And again.

I remember the excitement when it first became possible to produce
typeset-quality documents with Troff, R, DSR, Scribe, TeX, and their
relatives. But I also continued to produce plain-text "documents" on a
daily basis: email; netnews; computer programs in assembly language,
Sail, Simula, C, Fortran, Pascal, PL/I, etc; online documentation that
had to be portable to hundreds of platforms; plain-text record-oriented
databases -- mailing lists for example. There is no reason for most of
this sort of information to be "rich" and that this type of work should
not continue in Unicode.

What is needed is emphatic allowance and support for Unicode plain text
in the Unicode standard, i.e. a precise and thorough definition of what
constitutes a self-contained preformatted plain-text document. This is
primarily a matter of adopting a small but complete set of control codes
needed for line breaks, paragraph breaks, page breaks, and direction
control (most of these are already there), and a clear statement of the
role of the "traditional" control characters at U+0000 - U+001F, U+007F,
and U+0100 - U+011F.

And outside the scope of the Unicode standard is the problem of properly
tagging files in the file system. This has never been done right, on
any operating system. The use of the "extension" (the part of the name
after the dot, e.g. "DOC") is just plain silly, especially now that
GUI-based operating systems are using this to associate applications
with files -- click on a data file, launch the associated application on
that file. What's silly about it is that anybody can name a file any
way they please and there is no registration authority for extensions;
conflicts inevitably arise -- sometimes with disastrous consequences.
Even sillier is the idea the each file must belong to one and only one
application.

Plain text files can be used by many applications, but how do we mark
them as being written in Unicode? Or Latin-1? Or JIS X 0208, etc.
Ideally there should be information in the directory entry to specify
the file type and encoding. That's an issue for each OS maker, but one
whose resolution is long overdue.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT