Re: Unicode plain text (Was: Line Separator Character)

From: Edward Cherlin (cherlin@cauce.org)
Date: Tue May 20 1997 - 02:26:33 EDT


kenw@sybase.com (Kenneth Whistler) wrote:

[snip]
>You can still use U+000C FORM FEED in Unicode plain text, and a renderer that
>knows about page breaks can do the "right thing", namely whatever it did with
>^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to
>be ambiguous enough in usage (unlike CR/LF) to require any separate encoding
>in Unicode.
>
>> In any case, the strong Use-A-GUI thrust of Unicode will make it
>>increasingly
>> difficult for certain kinds of people to operate in the ways to which they
>> have become accustomed over the past decades in which plain text was "good
>> enough" save that one could not put lots of languages into it.
>
>The goal of Unicode plain text is to recapture that portability in the
>encoding, but also allow you to put lots of languages into it. The "Use-A-GUI
>thrust" of Unicode acknowledges the fact that rendering of complex scripts
>(including the Latin script with generative use of combining marks) requires
>logic that is much more amenable to implementation in a GUI framework than in
>a terminal model. However, appropriate (and very large and useful) subsets of
>Unicode *can* be implemented with simple rendering models. (Cf. Windows NT
>until very recently. :-) )
>
>> I can move this letter to practically any
>> other platform and it will still be perfectly legible and printable -- no
>> export or import or conversion or version skew to worry about. I think
>>a lot
>> of people would be perfectly happy to do the same in a plain-text Unicode
>> world using plain-text Unicode terminals and printers, if there were such
>> things.

The Everson Mono fonts would suit such a product admirably, up to a point.

>That is exactly what Unicode plain text is all about. And, by the way,
>Notepad on Windows NT was pretty close to being a "plain-text Unicode
>terminal".
>
>> The idea that one must embed Unicode in a higher level wrapper (e.g. a
>> Microsoft Word document, or even HTML) to make it useful has a certain
>> frightening consequence: the loss of any expectancy of longevity for our new
>> breed of documents.
>
>There is absolutely nothing new about this. I was warning my linguistic
>colleagues about the longevity of their documents when they started using
>WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed
>stable enough and was widely enough implemented to retain easy
>transmissibility
>across the computer generations without the intervention of information
>archaeologists. Well, 16-bit Unicode plain text is aimed at no less a
>goal than being the universal wide-ASCII plain text of the 21st century.
>
[snip]
>
>> So let's do our part and make some effort to accommodate traditional
>> plain-text applications in Unicode, rather than discourage them :-)
>
>I agree completely. An excellent example of the appropriate place for
>a Unicode plain-text editor would be a Java IDE. If someone writes
>a good Unicode plain-text editor for such an application, it would
>have wider applicability. (I know I often use the editors of C++
>IDE's to create (ASCII) plain text when I don't want it all gummed up
>as a Word or Frame document.)
>
>Ed Cherlin commented:
>
>> We want to have a uniform, portable definition of the meaning of a file of
>> 16-bit character codes interpreted as Unicode, or "Unicode text file" for
>> short. At the same time, we have several uses for such files, where
>> different interpretations may be desired. If we want to do this right, I
>> think we have to find the appropriate organization for defining such file
>> formats and uses, and get down to some serious and at times difficult
>> standard making. The Unicode character code standard does not seem to be
>> the right place to do this.
>
>I disagree about the last point. A Unicode plain text file consists of
>a stream of Unicode characters (and nothing else), interpreted according
>to the Unicode standard. It should be marked with an initial U+FEFF (though
>technically that is optional). This much is already clear from the standard,
>as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal,
>unambiguous, plain text formatting consistent with the bidi algorithm.

I'm not concerned about where. If the Unicode standard is an acceptable
place to do this, I'm in.

>The situation is complicated by the two possible byte orders (which is one
>reason for the U+FEFF) and by the fact that the most widely implemented
>variant, namely that in Windows NT, chose LSB order instead of MSB order.
>
>But other than that, there is not much more to be said about a Unicode
>plain text file. The usefulness of the concept lies in its simplicity.
>
>--Ken Whistler

I disagree about the simplicity of the problem. Some of the leading issues are:

byte order in storage and transmission
line, paragraph, and page breaks
BIDI (Hebrew, Arabic, etc.)
non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.)
multiply accented characters (IPA, math, several human languages)
math
compatibility characters
private use characters
control codes
other deprecated characters
surrogates, especially unpaired surrogate codes
non-character values
text processing algorithms (sorting, upper and lower case, pattern matching)

Full portability of data requires some rules. If there is no standard,
users of "Unicode text files" will make every possible choice about each of
these issues. CRLF will be nothing in comparison. We have begun to see
programs that can handle CRLF, CR alone, and LF alone, either line-by-line
or in paragraph format, reading and writing in any option. The range of
choices for Unicode is far greater, and I don't want to think about how
long it would take to achieve unity if we don't do it now.

The process for dealing with byte order is fairly simple in itself, and the
standard gives clear conformance requirements. Most of the other issues I
listed have thorns, few in some cases, and many in others.

When I was in Korea in the 1960s, telegrams were printed linearly, so
Koreans can read this form of their script if they have to. Indic scripts,
Ethiopic, and a few others, would require special training to read as
separate elements in a straight line. Do we wish to say that users of these
scripts can't have text files? Do we say we have to come up with a suitable
rendering method for Unicode text files including full BIDI and full
character-->glyph composition? Do we say that there should be
implementation levels? None of these alternatives is quite satisfactory at
present.

--
Edward Cherlin       Help outlaw Spam     Everything should be made
Vice President     http://www.cauce.org      as simple as possible,
NewbieNet, Inc.  1000 members and counting      __but no simpler__.
http://www.newbie.net/    17 May 97   Attributed to Albert Einstein



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT