RE: A basic question on encoding Latin characters

From: Kenneth Whistler (
Date: Tue Sep 28 1999 - 15:12:13 EDT

Frank continued this discussion:

> > If the protocol says that "login:" is to be acted upon, I don't see why
> > the terminal-side script couldn't act on it without waiting for eventual
> > combining characters that won't be coming. There's no use in waiting for
> > the next base character, the triggering string has been received.
> >
> But then is the application "Unicode compliant"?

Of course it is. If the application is waiting for "login:", it is not
waiting for "login:" with an acute accent on the colon. It is interpreting
what it is supposed to, given the characters encoded at the code
values they have. If the communicator then sends a combining acute accent,
that is a *protocol* error, not a Unicode compliance problem.

> But more to the point
> (bearing in mind that we are speaking not just of logging in, but any prompt
> and response), if we ignore the possibility that combining characters might
> follow the trigger string, then we can have "false positives", or for that
> matter also false negatives.

Once again, this would be a *protocol* error. If the communication protocol
is waiting for "xxxx", then it should act when it receives the final
"" as a unit, or if it has received an "a", then it should act when it
receives the final combining acute accent. And ordinarily the communication
protocol should specify a normalized form, so it doesn't have to deal
with alternative forms as equivalent for these purposes.

And many of these call/response protocols wait for a control code as the
trigger anyway, right? Very often the EOL. Otherwise they are rather
badly behaved, for interactive work anyway, since a host would then always
be sending bad typists irrelevant error messages without letting them
backspace and correct their errors before committing to send a chunk for
interpretation as a response/command/whatever.

> "Mark E. Davis" <> wrote:
> > We should make it very clear that Normalization Form C does *not*
> > eliminate combining characters. It does precompose them where possible,
> > but for many scripts and characters it is not possible, or desireable.
> >
> Yes, this is spelled out very clearly in the technical report. In this way
> Unicode Normalization Form C differs from ISO 10646 Implementation Level 1,
> in which "a CC element shall not contain coded representations of combining
> characters". I think this more accurately represents the position taken by
> the authors of Plan 9 and (correct me if I'm wrong) those working on the
> Linux console and UTF-8 xterm.

And as the Unicoders have continually pointed out, Implementation Level 1
is a crutch for brain-damaged implementations that cannot handle anything
complex. It rules out support for all of the complex scripts of the world.
It does, however, do a reasonable job of covering Europe and East Asia,
aside from some minority languages. Hmmm. Sound like a recipe for maintaining
the computing access status quo to anyone?

> > Exactly the same problem that you discuss occurs with any script that
> > requires shaping. When I type an Arabic character, the previous character
> > needs to change shape. What the terminal needs to do is replace the glyph
> > on the screen with a different form. As I recall from my terminal days,
> > the controls for doing this are available. The same technique can be used
> > for accents. Type an A, see an A. Then type an umlaut, and the host picks
> > it up, decides that it needs a composed presentation form, and replaces
> > the A by on the screen. Of course, the display on the terminal still
> > depends on the ''font" that it has, which may or may not allow dynamic
> > composition, but fundamentally I don't see the problem.
> >
> The real problem comes in scripting. Scripts are a method of forcing
> intrinsically noncooperating processes to cooperate. Suppose a script is
> looking for "ABC", and ABC comes. If the next character will be a combining
> cedilla, this would not be a match. But if no more characters are coming
> (e.g. until there is some kind of response) then it would be, but how can
> the script know?

By the EOL or other end-of-content marking built into the protocol.

How many of these script protocols can you point to that really are
sitting posed hair-triggered forever waiting for the right (character)
byte to come down the wire? Or if they are, isn't the triggering character
usually a control delimiter of some sort? If you are worried about
false positives for some string followed by a combining character,
why not that same string followed by *ANY* character. You would have
to guarantee that no long response has any prefix that could be
misinterpreted (before the response was completely received) as a
shorter response.

> The best we can do is set a timeout period that is long
> enough to allow for the longest possible intercharacter spacing on the
> busiest day of the Internet and hope we haven't guessed wrong.

Why isn't this exactly the same problem for any prefix of any response,
even without combining characters?

> And even if
> we haven't, this technique would cause every match to consume the entire
> timeout interval.

Sounds like a purty flimsy strawman to me.


> - Frank

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT