RE: A basic question on encoding Latin characters

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Tue Sep 28 1999 - 16:15:29 EDT


Ken wrote:
> Sounds like a purty flimsy strawman to me.
>
It might well be.

> > But if no more characters are coming
> > (e.g. until there is some kind of response) then it would be [a match],
> > but how can the script know?
>
> By the EOL or other end-of-content marking built into the protocol.
>
But there is no protocol. Most prompts do not end with an EOL.

A script is by nature an attempt to codify human behavior in a
stimulus-response situation. The stimuli are designed for people, not
protocols, and in any case are usually not changeable (maybe you can change
them, but as soon as you do be prepared for screams of agony to go up from
the masses who, unbeknownst to you, depend for the livelihood on the prompts
not changing). Thus the script must adapt to whatever is on the other end
of the connection.

If the prompt is "login:" with no EOL, we can't force an EOL to come; ditto
for other dialog situations in which the prompt more likely to end with some
character that might reasonably be followed by a combining character (or not).

> ... ordinarily the communication
> protocol should specify a normalized form, so it doesn't have to deal
> with alternative forms as equivalent for these purposes.
>
I believe this is what telecommunications-oriented platforms and/or
applications are doing when they avoid the issue of combining forms by saying
they don't support them.

> ... as the Unicoders have continually pointed out, Implementation Level 1
> is a crutch for brain-damaged implementations that cannot handle anything
> complex. It rules out support for all of the complex scripts of the world.
>
Meaning Indic, Arabic, etc... Of course this is true, and yet Level 1 exists
and developers will use it. We have in UTF-8 a vigorous attempt to embrace
the "legacy" terminal/host world and existing applications to promote easy
migration from ASCII to Unicode (and somewhat less easy from 8-bit character
sets). But these very platforms are accessed in a simple and open manner
which does not mesh well with complex scripts.

We might wish to wipe away the legacy of fifty years of computing and start
over (in more ways than one!) but I fear there will never be a replacement
for the simple and open terminal/host access method that will support
complex scripts and still be as open and vendor-neutral as the terminal/host
model. We are suffering already from the lack of open (e.g. Telnet) access
to Macintosh and Windows platforms.

I'm not saying I know what to do, only that "throw away your medieval tools
and enter the modern age" is as likely to result in a new Tower of Babel as
it is to promote universal communication. But this time the Babel is not in
character sets but in the profusion of ever-changing and incompatible
vendor- and application-specific protocols and data formats.

Perhaps it's all a tempest in a teapot. For some time to come we will have
all possible combinations of "legacy" and Unicode-aware hosts and clients,
and we have to allow for each combination. Different problems will come up
in each configuation, and we'll see how to deal with them. My hope is that
it will not be by inventing a neverending stream of Three-Letter Acronyms to
"comply" with, on top of Unicode itself, just to get text from point A to
point B. If you thought you hated ISO 2022, just think of the standards
nightmare that will grow out of that!

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT