RE: A basic question on encoding Latin characters

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Tue Sep 28 1999 - 14:18:49 EDT


Marco.Cimarosti@icl.com wrote:
> I am not sure if I understood very well, but seems to me that you are
> basing your observation on the very peculiar behavior of your application.
>
Not peculiar -- this is how open and shared access to computers has worked
since the 1960s: the interactive dialog model, prompt and command.

> I understand that your hypotetical terminal software is trying to render
> Unicode text as soon as it arrivers, CHARACTER BY CHARACTER.
>
That's how terminals work. If the host sends a character, the user should
see it on the screen immediately. As any maker of terminal emulation
software can tell you, users are surprisingly intolerant of delays, even
very small ones. The acid test is echoing in the full-duplex environment.
I press the 'A' key, the code for 'A' goes to the host and then comes back
to be displayed on the screen as an 'A'. This must be instantaneous.

Or, to put it another way, a terminal is not a Web browser.

> But there is no need of exotic alphabets or combining accents to screw up
> your design: sticking to good old ASCII, what would your modem script do
> if the prompt "login:" was translated in the Italian "codice d'accesso:"?
> It would wait, I think, until the Italian government changes the
> constitution to drop Italian and adopt English as the official language.
>
True, but the fact remains that a very large number of scripting applications
exist and are used every day in the real world, and they are used in
"mission-critical" applications too. It is "a way of doing business" in a
world where platforms such as UNIX, VMS, VOS, VM/CMS, MVS/TSO, and OS/400
still exist and may be accessed openly. Modems themselves are controlled
almost exclusively by scripts (how do you think your PPP dialer works?).

The business of Unicode is not to promote certain styles of computing and
obliterate others; it is to provide a universal character set that can be
used in any application.

> If such a medieval design cannot be avoided because of technical
> constraints, it would be wiser, in my mind to do one of the following:
>
> - support Unicode only after login;
>
Login is just one example. A terminal session with a UNIX (VMS, VOS, etc)
host is an arbitrary series of prompts and commands.

> - impose that the prompt and the answer be on separate lines: in this
> case, the line terminator character(s) would act as the "higher level
> protocol" to signal "ok, I'm finished transmitting, now it's your turn"
> that you suggested;
>
A proposal to change all of the world's hosts is not practical. Even if this
were done, it would break all the world's scripts :-)

> - re-ingeneer entirely the login and terminal software using more
> up-to-date techniques.
>
Of course many people believe the answer is to modernize everything. But
today this means replacement of simple, proven, and open means of access
with proprietary and unstable ones.

François Yergeau <yergeau@alis.com> wrote:
> There is no good reason for the terminal not to print the final character
> when received. If a combining character comes later, the terminal simply
> has to redisplay the combination over the previous glyph. This is what our
> Arabic terminals and emulators have been doing for years (e.g. receive an
> Arabic letter and display it in final form; receive another letter,
> redisplay the previous one in middle form and the new one in final form).
>
Yes, we discussed this here before; there are complications with line
wrapping, scrolling regions, etc, but to overcome them is a "mere matter of
programming".

> >There is no escape from this situation other than introduction of a "higher
> >level protocol" to signal "ok, I'm finished transmitting, now it's your
> >turn", just like in the old half-duplex days.
>
> Well, it seems to me that the login protocol *is* a higher level protocol
> w/r Unicode.
>
Again, the login process is only one element of a session consisting of an
arbitrary sequence of prompts and responses.

> If the protocol says that "login:" is to be acted upon, I don't see why
> the terminal-side script couldn't act on it without waiting for eventual
> combining characters that won't be coming. There's no use in waiting for
> the next base character, the triggering string has been received.
>
But then is the application "Unicode compliant"? But more to the point
(bearing in mind that we are speaking not just of logging in, but any prompt
and response), if we ignore the possibility that combining characters might
follow the trigger string, then we can have "false positives", or for that
matter also false negatives.

"Mark E. Davis" <markdavis@ispchannel.com> wrote:
> We should make it very clear that Normalization Form C does *not*
> eliminate combining characters. It does precompose them where possible,
> but for many scripts and characters it is not possible, or desireable.
>
Yes, this is spelled out very clearly in the technical report. In this way
Unicode Normalization Form C differs from ISO 10646 Implementation Level 1,
in which "a CC element shall not contain coded representations of combining
characters". I think this more accurately represents the position taken by
the authors of Plan 9 and (correct me if I'm wrong) those working on the
Linux console and UTF-8 xterm.

> Exactly the same problem that you discuss occurs with any script that
> requires shaping. When I type an Arabic character, the previous character
> needs to change shape. What the terminal needs to do is replace the glyph
> on the screen with a different form. As I recall from my terminal days,
> the controls for doing this are available. The same technique can be used
> for accents. Type an A, see an A. Then type an umlaut, and the host picks
> it up, decides that it needs a composed presentation form, and replaces
> the A by Ä on the screen. Of course, the display on the terminal still
> depends on the ''font" that it has, which may or may not allow dynamic
> composition, but fundamentally I don't see the problem.
>
The real problem comes in scripting. Scripts are a method of forcing
intrinsically noncooperating processes to cooperate. Suppose a script is
looking for "ABC", and ABC comes. If the next character will be a combining
cedilla, this would not be a match. But if no more characters are coming
(e.g. until there is some kind of response) then it would be, but how can
the script know? The best we can do is set a timeout period that is long
enough to allow for the longest possible intercharacter spacing on the
busiest day of the Internet and hope we haven't guessed wrong. And even if
we haven't, this technique would cause every match to consume the entire
timeout interval.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT