RE: A basic question on encoding Latin characters

From: Marco.Cimarosti@icl.com
Date: Tue Sep 28 1999 - 11:16:44 EDT


I am not sure if I understood very well, but seems to me that you are basing
your observation on the very peculiar behavior of your application.

I understand that your hypotetical terminal software is trying to render
Unicode text as soon as it arrivers, CHARACTER BY CHARACTER.

The font used by your terminal, I understand, has no combining characters.
So, each time it receives an e (say) it has to wait the next character to
see if it is a combining ^ (say) because, in this case, the two character
sequence would be converted to ê.

On a global perspective, this is not the major problem that I see in your
design: trying to render Unicode text on a per character basis would NEVER
work with many other features of Unicode.

For instance, imagine that you receive text in Arabic or in an Indic script.
Because of the way these alphabets are specified in Unicode, you need to to
have a whole "block" (i.e. line or paragraph) of text before you have all
the information you need for the complex processes of bidirectional
reordering, Indic reordering, and context shaping required for these writing
systems.

But there is no need of exotic alphabets or combining accents to screw up
your design: sticking to good old ASCII, what would your modem script do if
the prompt "login:" was translated in the Italian "codice d'accesso:"?
It would wait, I think, until the Italian government changes the
constitution to drop Italian and adopt English as the official language.

If such a medieval design cannot be avoided because of technical
constraints, it would be wiser, in my mind to do one of the following:

- support Unicode only after login;

- limit Unicode support in the login phase to the ASCII range (U+0000 to
U+007F) or, at best, to Latin 1 (U+0000 to U+00FF), and not even try to
implement relatively complex things as combining accents;

- impose that the prompt and the answer be on separate lines: in this case,
the line terminator character(s) would act as the "higher level protocol" to
signal "ok, I'm finished transmitting, now it's your turn" that you
suggested;

- re-ingeneer entirely the login and terminal software using more up-to-date
techniques.

Regards.
        Marco Cimarosti

> -----Original Message-----
> From: Frank da Cruz [SMTP:fdc@watsun.cc.columbia.edu]
> Sent: 1999 September 28, Tuesday 16.10
> To: Unicode List
> Cc: unicode@unicode.org
> Subject: Re: A basic question on encoding Latin characters
>
> > Um, at that time the normalization hadn't been done. So at that time
> there
> > weren't _technical_ reasons for drawing a line at the normalization
> > border. The line was drawn after that time. It could have been
> > before. But it has been drawn and there had better be really good
> reasons
> > offered if we are not to respect it.
> >
> In interactive telecommunications, we have the following situation:
>
> 1. Host sends "login:" (or any other prompt).
> 2. User is supposed to type her ID (or any other response).
>
> When using Unicode, the terminal emulator may not print the final
> character
> of the prompt because it doesn't know yet whether any combining characters
> will follow. So the user doesn't know whether the host is ready to
> receive
> a response and therefore should not reply since in some cases (e.g. at the
> UNIX "Password:" prompt) an early response is discarded.
>
> If the process is being executed by a script, the script sits and waits;
> "waitfor 'login:'" will not succeed, since it can not be known whether
> 'login:' has arrived until the next base character after ':' comes, but no
> such character is coming (I realize it is silly to expect a colon to have
> an
> accent but those are the rules -- and not all prompts end with colon).
>
> There is no escape from this situation other than introduction of a
> "higher
> level protocol" to signal "ok, I'm finished transmitting, now it's your
> turn", just like in the old half-duplex days.
>
> This is the kind of reason that telecommunications-oriented applications
> seem to be steering away from the Normalization Form D model, however
> appropriate it might be in other areas, and embracing Normalization Form C
> (ISO 10646 Level 1) and, by extension, precomposed characters, as we have
> seen in Plan 9 and now, it seems, Linux. I don't think this indicates
> recalcitrance or West European bias in UNIX culture as much as a desire to
> preserve telecommunications and the terminal/host model as a viable
> interface between human and machine in the Unicode age, as it has been
> since
> beginning of the computer age. I also think it's no accident that Unicode
> is best supported on those platforms that have eschewed the terminal/host
> access model.
>
> - Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT