RE: A basic question on encoding Latin characters

From: Marco.Cimarosti@icl.com
Date: Wed Sep 29 1999 - 05:26:13 EDT


I know how Unix-like shells and terminal applications work and what they are
used for.

As you correctly guess, I used to have a PPP connection script that worked
exactly as the one you described (and, by the way, it screw up when one of
the employees at my Internet provider bought an English grammar and decided
that "Log in:" was better English than "login:").

But, right because these things *already* work fine, I don't see the point
of having Unicode support in such applications.
It is like adding legs to the drawing of a snake (Chinese proverb).

But, if Unicode support HAS to be, I don't understand how you hope to add it
without changing something.

If you want/need Unicode and you don't want to give up even a penny of the
old behavior of showing characters as soon as they arrive, your only choice
is doing what someone already suggested: you have to redraw part or all of
the current line of text each time a new character arrives.
That what an Unicode editor or word processor do when you are at the end of
the line (or elsewhere) and type one more character.

And you should not "wait" for an accent more than you wait for any other
character. Consider that the "h" in English digraphs like "sh", "ch" or "th"
is pretty much the same thing as a diacritical sign. So what should your
terminal do? Stop each time it sees an "s" just because it could be the
first part of a "sh"!?

Regards.
        Marco

> -----Original Message-----
> From: Frank da Cruz [SMTP:fdc@watsun.cc.columbia.edu]
> Sent: 1999 September 28, Tuesday 20.18
> To: Unicode List
> Subject: RE: A basic question on encoding Latin characters
>
> Marco.Cimarosti@icl.com wrote:
> > I am not sure if I understood very well, but seems to me that you are
> > basing your observation on the very peculiar behavior of your
> application.
> >
> Not peculiar -- this is how open and shared access to computers has worked
> since the 1960s: the interactive dialog model, prompt and command.
>
> > I understand that your hypotetical terminal software is trying to render
> > Unicode text as soon as it arrivers, CHARACTER BY CHARACTER.
> >
> That's how terminals work. If the host sends a character, the user should
> see it on the screen immediately. As any maker of terminal emulation
> software can tell you, users are surprisingly intolerant of delays, even
> very small ones. The acid test is echoing in the full-duplex environment.
> I press the 'A' key, the code for 'A' goes to the host and then comes back
> to be displayed on the screen as an 'A'. This must be instantaneous.
>
> Or, to put it another way, a terminal is not a Web browser.
>
> > But there is no need of exotic alphabets or combining accents to screw
> up
> > your design: sticking to good old ASCII, what would your modem script do
> > if the prompt "login:" was translated in the Italian "codice
> d'accesso:"?
> > It would wait, I think, until the Italian government changes the
> > constitution to drop Italian and adopt English as the official language.
> >
> True, but the fact remains that a very large number of scripting
> applications
> exist and are used every day in the real world, and they are used in
> "mission-critical" applications too. It is "a way of doing business" in a
> world where platforms such as UNIX, VMS, VOS, VM/CMS, MVS/TSO, and OS/400
> still exist and may be accessed openly. Modems themselves are controlled
> almost exclusively by scripts (how do you think your PPP dialer works?).
>
> The business of Unicode is not to promote certain styles of computing and
> obliterate others; it is to provide a universal character set that can be
> used in any application.
>
> > If such a medieval design cannot be avoided because of technical
> > constraints, it would be wiser, in my mind to do one of the following:
> >
> > - support Unicode only after login;
> >
> Login is just one example. A terminal session with a UNIX (VMS, VOS, etc)
> host is an arbitrary series of prompts and commands.
>
> > - impose that the prompt and the answer be on separate lines: in this
> > case, the line terminator character(s) would act as the "higher level
> > protocol" to signal "ok, I'm finished transmitting, now it's your turn"
> > that you suggested;
> >
> A proposal to change all of the world's hosts is not practical. Even if
> this
> were done, it would break all the world's scripts :-)
>
> > - re-ingeneer entirely the login and terminal software using more
> > up-to-date techniques.
> >
> Of course many people believe the answer is to modernize everything. But
> today this means replacement of simple, proven, and open means of access
> with proprietary and unstable ones.
>
> François Yergeau <yergeau@alis.com> wrote:
> > There is no good reason for the terminal not to print the final
> character
> > when received. If a combining character comes later, the terminal
> simply
> > has to redisplay the combination over the previous glyph. This is what
> our
> > Arabic terminals and emulators have been doing for years (e.g. receive
> an
> > Arabic letter and display it in final form; receive another letter,
> > redisplay the previous one in middle form and the new one in final
> form).
> >
> Yes, we discussed this here before; there are complications with line
> wrapping, scrolling regions, etc, but to overcome them is a "mere matter
> of
> programming".
>
> > >There is no escape from this situation other than introduction of a
> "higher
> > >level protocol" to signal "ok, I'm finished transmitting, now it's your
> > >turn", just like in the old half-duplex days.
> >
> > Well, it seems to me that the login protocol *is* a higher level
> protocol
> > w/r Unicode.
> >
> Again, the login process is only one element of a session consisting of an
> arbitrary sequence of prompts and responses.
>
> > If the protocol says that "login:" is to be acted upon, I don't see why
> > the terminal-side script couldn't act on it without waiting for eventual
> > combining characters that won't be coming. There's no use in waiting
> for
> > the next base character, the triggering string has been received.
> >
> But then is the application "Unicode compliant"? But more to the point
> (bearing in mind that we are speaking not just of logging in, but any
> prompt
> and response), if we ignore the possibility that combining characters
> might
> follow the trigger string, then we can have "false positives", or for that
> matter also false negatives.
>
> "Mark E. Davis" <markdavis@ispchannel.com> wrote:
> > We should make it very clear that Normalization Form C does *not*
> > eliminate combining characters. It does precompose them where possible,
> > but for many scripts and characters it is not possible, or desireable.
> >
> Yes, this is spelled out very clearly in the technical report. In this
> way
> Unicode Normalization Form C differs from ISO 10646 Implementation Level
> 1,
> in which "a CC element shall not contain coded representations of
> combining
> characters". I think this more accurately represents the position taken
> by
> the authors of Plan 9 and (correct me if I'm wrong) those working on the
> Linux console and UTF-8 xterm.
>
> > Exactly the same problem that you discuss occurs with any script that
> > requires shaping. When I type an Arabic character, the previous
> character
> > needs to change shape. What the terminal needs to do is replace the
> glyph
> > on the screen with a different form. As I recall from my terminal days,
> > the controls for doing this are available. The same technique can be
> used
> > for accents. Type an A, see an A. Then type an umlaut, and the host
> picks
> > it up, decides that it needs a composed presentation form, and replaces
> > the A by Ä on the screen. Of course, the display on the terminal still
> > depends on the ''font" that it has, which may or may not allow dynamic
> > composition, but fundamentally I don't see the problem.
> >
> The real problem comes in scripting. Scripts are a method of forcing
> intrinsically noncooperating processes to cooperate. Suppose a script is
> looking for "ABC", and ABC comes. If the next character will be a
> combining
> cedilla, this would not be a match. But if no more characters are coming
> (e.g. until there is some kind of response) then it would be, but how can
> the script know? The best we can do is set a timeout period that is long
> enough to allow for the longest possible intercharacter spacing on the
> busiest day of the Internet and hope we haven't guessed wrong. And even
> if
> we haven't, this technique would cause every match to consume the entire
> timeout interval.
>
> - Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT