Re: Unicode plain text standard? (was Re: Line Separator Character)

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Tue May 20 1997 - 17:47:49 EDT


> >I'm not sure what you're after. I'm mainly concerned about the continued
> >viability of files containing only graphic characters, spaces, line breaks,
> >paragraph breaks, and formfeeds. Plain, literal text that can contain
> >poetry, tables, source code, you name it, and stays like it is.
>
> I can tell you don't know what table building in Sanskrit is like, and you
> don't understand BIDI direction marking.
>
Not Sanskrit, certainly, but I know a little about Hebrew by virtue of having
devoted some time to issues of Hebrew terminal emulation in the plain-text
world, and our Kermit terminal emulators (the software we make here) are quite
popular in Israel. But yes, one must go through more than a few contortions
on one end or the other (or both) to handle BIDI issues in the terminal/host
setting, to the extent that Hebrew is (according to my sources) hardly used at
all in email. The contortions involve generation and interpretation of
terminal-specific escape sequences for cursor positioning, reversal of writing
direction, character insertion, etc, and of course character-set invocation
and designation, all of which obviously add up to something more than plain
text.

So sure, of course I agree that plain streams of text are not adequate for
writing systems that are intrinsically bidirectional (like Hebrew) or for
which correct rendering is variable and context-dependent (Indic scripts,
etc).

(So where, you might ask, is Hebrew terminal emulation used? As far as I
know, the major application by far is in library information systems like
ALEPH; there are some others, like a Hebrew version of the "vi" editor and
more recently, Mule (Multilingual EMACS). At one point some years ago I
thought (naively) that the very same mechanisms could be used for Arabic
(after all, PCs have an Arabic code page), but in practice, as far as I can
tell, no speaker of Arabic would be satisfied with a character-cell
representation of Arabic text, because of the way characters must change
shape depending on their context (as you point out), which is evidently not
an issue in Hebrew (although it might be in Yiddish).)

> Having lived in Korea and Japan, and been a mathematician and APL
> programmer, I lost all faith in ASCII long ago.
>
Right -- I wasn't suggesting we all revert to ASCII -- the ability to write
text in as many languages as possible is why we're here! I am looking for the
option to extend the simplicity (and success) of ASCII to Unicode -- or at
least to the large subset of it (as Ken said) that can be used "like ASCII".
To me this means the ability to compose a plain-text message containing a
certain amount of formatting controls like line breaks, paragraph breaks, and
page breaks, that are part of the same code, and without application-specific
metacodes (SGML tags, Microsoft Word codes, etc). Let Unicode be able to
stand on its own! (Of course, also let it be used in other applications --
but that's not the issue.)

If additional considerations need to be applied to the world's more complex
scripts in order to have a standard universal representation for plain text,
to whatever extent the Unicode 2.0 standard does not already suffice, I'm all
for it. Let's not repeat the confusing aspects of ASCII -- particularly
CRLF/CR/LF semantics, and, as Ed suggests, let's not leave room for this kind
of confusion in areas that are new to Unicode.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT