Re: Plain Text

From: John Cowan (cowan@locke.ccil.org)
Date: Tue Jul 06 1999 - 11:34:15 EDT


Frank da Cruz wrote:

> We don't have to. If the Unicode Standard defines what plain text is,
> then conversion of 8-bit text to Unicode will put all the divergent
> platform-specific formats into the same Unicode format.

Or some other widely accepted source of standardization, such as
Oasis or ECMA or ISO or even W3C (though the first three, IMHO,
have a better "fit" to the subject matter).

> C1 control characters are kept if the source character
> set has them (e.g. a Latin Alphabet) and translated otherwise
> (e.g. CP850).

I take this to mean "Characters 0x80 to 0x9F are zero-bit-extended
if the source character set has C1 characters; if it does not
(like CP850, CP1252, or VISCII), they are translated to their
proper Unicode graphic equivalents."

> . Heuristics might be used to identify paragraphs and to separate them
> by Paragraph Separator. For example, a blank line is replaced by PS.
> Obviously there are pitfalls.

Indeed. For example, blank lines in source code, e.g., are not
necessarily paragraph marks. This might be a reasonable QOI
issue.
 
> . Any conversion program would probably need an option to deal with
> files with "word processor" record format, in which a line is really
> a paragraph.

Note that arbitrary-length lines do not meet the MIME definition
of "text" (and nor does UTF-16 text); such things should really
have a media type of "application/character-stream" or the like,
analogous to "application/octet-stream" but with a charset
parameter.

> > 0D 00 0A 00
> >
> > What do we do about that?
> >
> I would say that this practice should be discouraged ("be conservative in
> what you 'send'") in any application that creates or saves Unicode text
> files. But it should be allowed for ("be liberal in what you 'receive'") in
> any conversion/import program.

Does this Windows-Unicode text always have a proper little-endian BOM,
as I believe it does?

If so, then the only problem is the precise value of line
terminator. In practice, much of the Unicode text (perhaps
all of it) in the world today uses old line terminators, and
I think they must be explicitly allowed in a flexible definition of
preformatted Unicode plain text, even if tagged with SHOULD NOT.
 
> No, thase are higher-level protocols that will go out of fashion some day,
> probably sooner than you think. Of course you can define or use all the
> higher level protocols you want, but you should bear in mind they are
> ephemeral.

SGML is almost as old, as computer things go, as plain text. Though it
was not standardized until 1986, it was devised in 1974; ASCII itself
only dates to 1963 or so.

Moreover, unlike most file formats, SGML is character-based,
not octet-based, and does not depend on any specific processing
application, so whatever process refreshes Unicode data will
refresh SGML data too. (XML is merely a special case of SGML.)
I agree that preformatted plain text should not depend on SGML,
though; that is putting Cart before Horse.

[snip]

> Yes.

[snip]

> Double yes.

Sounds like a case of violent agreement.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
   Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
   Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT