Re: Unicode plain text

From: Timothy Partridge (timpart@perdix.demon.co.uk)
Date: Sat May 24 1997 - 11:46:02 EDT


We seem to have two different requirements for plain text here.
Now my assumption was that we would mostly want to use one type, whereas
there seems to be a strong demand for another. At the risk of teaching
you all to suck eggs I will contrast and compare them at some length.
I hope you will find a useful point or two.

First the type I had assumed as the default.
I would call this logical formatting.

Paragraph Separator is most commonly used. Text usually runs on without
any control characters until a new paragraph is needed. Since this
is logical formatting the author does not know or care whether a
paragraph is indicated by a completly blank line or a new line is
started with an indent or some other convention.

Occasionally the author wants a line break (for example in a bulleted list)
without having a paragraph break indicated, so Line Separator (LS) is used.
Since LS effectively delimits a new section of the text there is no problem
with BiDi treating it as a new block, indeed this behaviour is desirable.

Rarely the author wants a Page Separator. This would typically be used in
a long work when a new chapter was started. BiDi should treat this as ending
the current paragraph and therefore it separates blocks.

The assumptions behind this GUI style approach include:
* The renderer is capable of splitting paragraphs up into printed lines and
  starting new pages as needed.
* No particular font or paper size is assumed.

     The second type I would call physical formatting.
     The text has already been formatted by the author into lines and
paragraphs. (Just as I have done with this e-mail. I've also changed my
paragraph convention.)
     This typically uses characters like Carriage Return (CR) and Line Feed (LF)
to represent a New Line (NL). NL is typically represented by CR LF, LF CR,
CR or LF. Form Feed (FF) is also used. This moves the paper to the start of
a new page. It may also move the printing position to the start of the line.
(If it doesn't sequences like CR FF or NL FF are used.). I have deliberately
used the names NL and FF rather than LS and PS to make a distinction.
     NL (represented in some way) is the most common control character since
the position of line breaks has already been determined. There are many NLs
within each paragraph. Since NL usually does not denote any logical division
in the text it is extremely annoying if the BiDi algorithm treats it as a new
block. An author may sometimes force a new line for the same reasons noted for
for LS in the first example. In this case it is useful for BiDi to treat it
as a new block, *but* since NL is also used for this purpose it is very
hard to distinguish this use.
     Paragraphs often don't have any control code to denote them, but can be
detected by looking for sequences like NL NL or NL Space Space Space Space.
If no explicit paragraph code is present, the BiDi algorithm has a very
hard job. If paragraph codes are present, then the text is an unusual hybrid
between physical and logical formatting and may be post-processed physical
text.
     Page breaks are preformatted using FF and occur quite frequently. Like
NL page breaks usually do not denote any logical boundary in the text. Again
the BiDi algorithm should not take any notice. Rarely a FF is inserted by the
author for the same reasons as PS - in this case BiDi should take note but
will have a hard time distinguishing the usage.
     The assumptions behind this explicit approach include:
     * The text will go straight to a printer that is not very bright.
     * The author knows exactly how many characters fit on a line. (Often
       there is also the assumption that each character is fixed width.)
     * The author knows exactly how many lines fit on a page.
     * The author knows in which sequence the characters in a line will
       be printed. (Usually assumes left to right without any reordering.)

My personal preference is for the first approach. Yes, there is text around
which has been preformatted, but most of this is not in Unicode. When it is
converted to Unicode shouldn't the conversion process try to change physical
conventions to logical conventions at the same time?

If I was writing a new document in Unicode I would not try preformatting it.
I don't know how large your paper is for a start. (I've had a painfull
experience trying to print on 6 and two thirds inch paper a document that had
been formatted for seven inch paper, with form feeds at the bottom of each page.)

Some media, such as e-mail which we are using at the moment, require preformatted
text because I don't know whether the recipient's reader can cope with
continuous text. If I'm sending Unicode text can't I make a reasonable assumption
that the recipients all have renderers capable of handling continuous text?
Are line breaks just needed so that the transmission medium doesn't break?

   Tim

-- 
Tim Partridge. Any opinions expressed are mine only and not those of my employer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT