Unicode plain text (Was: Line Separator Character)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 19 1997 - 13:47:55 EDT

Crank, er... Frank,

>> HT and FF were very commonly used...
>(And still are...) Now there's an interesting point. Unicode has addressed
>the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it
>sometimes just as necessary to specify a hard page break as it is to specify a
>hard line or paragraph break?

You can still use U+000C FORM FEED in Unicode plain text, and a renderer that
knows about page breaks can do the "right thing", namely whatever it did with
^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to
be ambiguous enough in usage (unlike CR/LF) to require any separate encoding
in Unicode.

> In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly
> difficult for certain kinds of people to operate in the ways to which they
> have become accustomed over the past decades in which plain text was "good
> enough" save that one could not put lots of languages into it.

The goal of Unicode plain text is to recapture that portability in the
encoding, but also allow you to put lots of languages into it. The "Use-A-GUI
thrust" of Unicode acknowledges the fact that rendering of complex scripts
(including the Latin script with generative use of combining marks) requires
logic that is much more amenable to implementation in a GUI framework than in
a terminal model. However, appropriate (and very large and useful) subsets of
Unicode *can* be implemented with simple rendering models. (Cf. Windows NT
until very recently. :-) )

> I can move this letter to practically any
> other platform and it will still be perfectly legible and printable -- no
> export or import or conversion or version skew to worry about. I think a lot
> of people would be perfectly happy to do the same in a plain-text Unicode
> world using plain-text Unicode terminals and printers, if there were such
> things.

That is exactly what Unicode plain text is all about. And, by the way,
Notepad on Windows NT was pretty close to being a "plain-text Unicode terminal".

> The idea that one must embed Unicode in a higher level wrapper (e.g. a
> Microsoft Word document, or even HTML) to make it useful has a certain
> frightening consequence: the loss of any expectancy of longevity for our new
> breed of documents.

There is absolutely nothing new about this. I was warning my linguistic
colleagues about the longevity of their documents when they started using
WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed
stable enough and was widely enough implemented to retain easy transmissibility
across the computer generations without the intervention of information
archaeologists. Well, 16-bit Unicode plain text is aimed at no less a
goal than being the universal wide-ASCII plain text of the 21st century.

Grumpy aside: This goal is not helped by people who treat Unicode as
a standards dumping ground for assigning numbers to everybody's favorite
collection of junk vaguely related to text, or who try to infiltrate
mechanisms (such as language tags) that do not belong in plain text.

> So let's do our part and make some effort to accommodate traditional
> plain-text applications in Unicode, rather than discourage them :-)

I agree completely. An excellent example of the appropriate place for
a Unicode plain-text editor would be a Java IDE. If someone writes
a good Unicode plain-text editor for such an application, it would
have wider applicability. (I know I often use the editors of C++
IDE's to create (ASCII) plain text when I don't want it all gummed up
as a Word or Frame document.)

Ed Cherlin commented:

> We want to have a uniform, portable definition of the meaning of a file of
> 16-bit character codes interpreted as Unicode, or "Unicode text file" for
> short. At the same time, we have several uses for such files, where
> different interpretations may be desired. If we want to do this right, I
> think we have to find the appropriate organization for defining such file
> formats and uses, and get down to some serious and at times difficult
> standard making. The Unicode character code standard does not seem to be
> the right place to do this.

I disagree about the last point. A Unicode plain text file consists of
a stream of Unicode characters (and nothing else), interpreted according
to the Unicode standard. It should be marked with an initial U+FEFF (though
technically that is optional). This much is already clear from the standard,
as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal,
unambiguous, plain text formatting consistent with the bidi algorithm.

The situation is complicated by the two possible byte orders (which is one
reason for the U+FEFF) and by the fact that the most widely implemented
variant, namely that in Windows NT, chose LSB order instead of MSB order.

But other than that, there is not much more to be said about a Unicode
plain text file. The usefulness of the concept lies in its simplicity.

--Ken Whistler

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT