Re: Unicode plain text

From: Pierre Lewis (lew@nortel.ca)
Date: Sun May 25 1997 - 10:50:00 EDT


In message "Re: Unicode plain text", 'timpart@perdix.demon.co.uk'
writes:

> We seem to have two different requirements for plain text here.
> ...
>
> First the type I had assumed as the default.
> I would call this logical formatting.
> ...

This first type (usually the result of "save as text" from some WP)
always causes me trouble and I usually have to reformat it before I can
do anything with it (such as printing it).

> The second type I would call physical formatting.
> The text has already been formatted by the author into lines and
> paragraphs...

I think the second type is by far the most common and is what I
consider to be plain text:

o It's the format of all RFCs, perhaps the most widely-read plain-text
   files around,

o It's the format of the vast majority of email and Usenet posts I read
   (but I do see some type 1 stuff),

o It's the format of much e-documentation that comes with many S/W
   (eg. linux, TeX (at least installation), X.11, ...),

o It's the natural format of all a2ps (ascii-to-postscript) converters
   I've come across, and (last but not least)

o It's the format chosen by project Gutenberg, the wonderful collection
   of English texts. I have a dream here, of a multi-lingual project
   Gutenberg with classics in various languages, and, of course, in
   plain-text Unicode....

   (URL: ftp://uiarchive.cso.uiuc.edu/pub/etext/ )

I'd be really curious to see how one would express RFC2070, on
"Internationalization of the Hypertext Markup Language", as a type 1
plain-text file (for those looking for a challenge: type 2 plain-text
file of this RFC is at: http://ds.internic.net/rfc/rfc2070.txt).

Of course, type 2 means some assumptions.

> * The author knows exactly how many characters fit on a line. (Often
> there is also the assumption that each character is fixed width.)

True enough, and that may break down somewhat with ideograms (surely
one can't fit 80 of those on a line). But, in general, staying under 80
chars will give a plain-text file that most can print. I rarely have
trouble printing a plain-text file of this second type. And I think this
will work with a lot of scripts, eg. Russian, Greek, Hebrew, Arabic.

> * The author knows exactly how many lines fit on a page.

Most plain-text files have no FFs, but when they do (as RFCs do), it's
not too difficult to be conservative so that again most folks can print
them with no problem. I don't see FFs as being on the slippery slope to
pretty text. Besides their use in RFCs (so the TOC can be paginated),
they're also often used to separate "chapters". For example, I'll save
all the posts on the current threads, and I'll probably put an FF
between each one so that, if/when I print the whole thing, I'll get
each post to start on a new page.

> * The author knows in which sequence the characters in a line will
> be printed. (Usually assumes left to right without any reordering.)

That's where it gets interesting (and why I had a few questions a few
days ago). The only ordering possible within the plain-text Unicode
file is of course logical. So that means a bit more intelligence in the
a2ps conversion or in the display engines. Or, in despair, such a file
could be put thru a filter that would reorder it into visual ordering
for local consumption.

In summary, notwithstanding some difficulties, I still think a
plain-text Unicode file of the second type above makes perfect sense
and would be very useful. I'm still not too sure how exactly I would
encode it (wrt controls), but this thread has been quite helpful.

Btw, this type 1 vs type 2 is a very useful distinction, and I think
therein lies the source of much confusion in the current threads.

Pierre
lew@nortel.ca

P.S. It's probable that my view of things is somewhat colored by my
Unix bigotry. But still...



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT