Unicode plain text standard? (was Re: Line Separator Character)

From: Edward Cherlin (cherlin@cauce.org)
Date: Tue May 20 1997 - 02:57:56 EDT


>Oops, never mind -- it was this:
>
>> We want to have a uniform, portable definition of the meaning of a file of
>> 16-bit character codes interpreted as Unicode, or "Unicode text file" for
>> short. At the same time, we have several uses for such files, where
>> different interpretations may be desired. If we want to do this right, I
>> think we have to find the appropriate organization for defining such file
>> formats and uses, and get down to some serious and at times difficult
>> standard making. The Unicode character code standard does not seem to be
>> the right place to do this.
>>
>I'm not sure what you're after. I'm mainly concerned about the continued
>viability of files containing only graphic characters, spaces, line breaks,
>paragraph breaks, and formfeeds. Plain, literal text that can contain
>poetry, tables, source code, you name it, and stays like it is.

I can tell you don't know what table building in Sanskrit is like, and you
don't understand BIDI direction marking.

>Pretty much what we have today with 7- and 8-bit plain text, except without
>the confusion over CRLF/CR/LF, etc.

and the utter incompatibility of the extra 128 characters in the 8-bit sets
between PC DOS, PC Windows, Mac, various Unix definitions, and all the
other extended ASCII code sets such as PC code pages and the ISO 8859
series. Files of 8-bit characters are extremely non-portable.

Having lived in Korea and Japan, and been a mathematician and APL
programmer, I lost all faith in ASCII long ago. It is horribly inadequate
for English, and more so for almost any other language, except for various
computer programming languages and constructed languages like Lojban, which
were deliberately built within the limits of ASCII, or in the old days
EBCDIC.

>I think that what's really valuable about
>these files is their self-contained and independent expressiveness -- they
>don't need a rendering engine, they don't need any special transport protocol
>-- they contain the text and the minimal control information to be transported
>and understood universally.

>- Frank

I agree on the transport protocol in principle, although today we need
UTF-7, UTF-8, and other encodings, but the idea of full Unicode text
without a rendering engine won't fly.

That's fine for simple alphabetic scripts, and even for Chinese and
Japanese. It doesn't work right for RTL scripts (Arabic and Hebrew),
especially for mixtures of RTL and LTR, and for scripts that combine
characters into larger groups, usually syllables. This includes Korean, all
of the Indic scripts, Tibetan, and Ethiopic. Arabic script has a very large
dependence on ligatures, some of them quite complex. There are also
problems for rendering math expressions in plain text. Then there are
various deprecated characters, the private use areas, and the surrogate
character mechanism.

Anyone who thought the CRLF business was bad should consider how many
incompatible choices can be made in Unicode. Yes, it is true that the Unix
file model of a sequence of uninterpreted bytes is very general, and so is
a file of uninterpreted 16-bit codes, but files have to be interpreted to
be useful. We gloss over the amount of interpretation we do on ASCII text
files, but we cannot do that with Unicode.

--
Edward Cherlin       Help outlaw Spam     Everything should be made
Vice President     http://www.cauce.org      as simple as possible,
NewbieNet, Inc.  1000 members and counting      __but no simpler__.
http://www.newbie.net/    17 May 97   Attributed to Albert Einstein



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT