Re: Unicode plain text

From: Timothy Partridge (timpart@perdix.demon.co.uk)
Date: Mon May 26 1997 - 13:57:51 EDT


Pierre Lewis recently said:

> This first type (usually the result of "save as text" from some WP)
> always causes me trouble and I usually have to reformat it before I can
> do anything with it (such as printing it).

In my opinion a Unicode renderer should cope with this automatically
and divide paragraphs up into lines for you. This is mostly because
of the intelligence of the BiDi algorithm. What you won't get is
page headers and footers and page numbers since there is no way to
specify them in Unicode plain text. Is there general agreement that
text that is only split into paragraphs should be rendered properly
by a Unicode engine? I.e. it is acceptable as plain text.
 
> I think the second type is by far the most common and is what I
> consider to be plain text:
>
> o It's the format of all RFCs, perhaps the most widely-read plain-text
> files around,
[snip]
> o It's the format chosen by project Gutenberg, the wonderful collection
> of English texts. I have a dream here, of a multi-lingual project
> Gutenberg with classics in various languages, and, of course, in
> plain-text Unicode....
>
> (URL: ftp://uiarchive.cso.uiuc.edu/pub/etext/ )
>
> I'd be really curious to see how one would express RFC2070, on
> "Internationalization of the Hypertext Markup Language", as a type 1
> plain-text file (for those looking for a challenge: type 2 plain-text
> file of this RFC is at: http://ds.internic.net/rfc/rfc2070.txt).

Can I have the original source please! I suspect that documents like this
have been prepared in some markup language and sent through something like
troff.

> Of course, type 2 means some assumptions.
>
> > * The author knows exactly how many characters fit on a line. (Often
> > there is also the assumption that each character is fixed width.)
>
> True enough, and that may break down somewhat with ideograms (surely
> one can't fit 80 of those on a line). But, in general, staying under 80
> chars will give a plain-text file that most can print. I rarely have
> trouble printing a plain-text file of this second type. And I think this
> will work with a lot of scripts, eg. Russian, Greek, Hebrew, Arabic.

I'm not so sure that fixed width Arabic will look good but the general point
holds. But should I need to fiddle with point sizes if Unicode renderers will
accept type 1 text.

Type 2 text is very common. And it is the published form. In some cases the
original marked up text will have been lost. Where it hasn't a Unicode type 1
style plain text file could be produced from the original.

I dug out some troff documentation and it says that the plain text output is
a representation that is an approximation to the printed page. I suggest
that much of the type 2 text is in this form, i.e. Formatting *including* BiDi
has already been carried out. Does anyone have examples of mixed direction
text in RFC style format that could confirm this?

I think that for type 2 physical format files Unicode rendering is *too*
intelligent and would scramble the preformatted lines if they contained BiDi
text. (As well as getting horribly confused by the NLs which presumably have
been converted to Line Separator.)

I would propose a new control code - Disable BiDirectional Processing which
would switch off BiDi altogether. It could be used with physical format files
so that they come out as intended. (There needs to be an Enable code as well.)

I'll also allow you a Page Separator. This would be treated as a block
separator by BiDi and would cause a new page to be started.

The introduction of a new control code would mean that existing text that uses
the current standard would work in the same way, but additional control could
be given to text that needs it.

   Tim

-- 
Tim Partridge. Any opinions expressed are mine only and not those of my employer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT