Re: Normalization Form KC for Linux

From: Rick McGowan (
Date: Fri Aug 27 1999 - 19:37:28 EDT

Frank said..

> This raises issues discussed long ago. The model of base character
> followed by nonspacing diacritics (as opposed to the other way around)
> does not mesh well with terminal/host communication

I completely beg to differ with this opinion. If your host<->tty stream is
well behaved with respect to its data encoding, there should be absolutely no
difficulty with a combining diacritic model. It is true that if you have an
ill-behaved data stream that intermixes displayable entities with control
sequences, you can have situations where, for example, a cursor positioning
sequence is interposed between a base glyph and a glyph intended to combine
with it. That doesn't mean the model is incompatible with tty/host
communications, it means the data-stream is ill-behaved.

Seems like Frank ​should be saying that EXISTING tty data protocols with
EXISTING data & codesets depend on NOT having a dynamically combining stream
of glyphs being shoved around.

> Then a nonspacing acute accent arrives.
> At this point, the terminal has to find where it left the A and change
> it to something else
> This is an awful lot of work and screen changing for little benefit when
> precomposed characters are already available.

The point that keeps being missed entirely in this kind of discussion is
that the entities in a protocol stream between tty and host are *not* Unicode
plain text -- they are data-encoding entities of a tty/host communication
protocol. Poorly behaved streams in such protocols can cause all kinds of
display glitches, that's the nature of control streams with randomly
positionable cursor display systems. And it's not a *new* problem when you
introduce combining characters.

Assuming you have a well-behaved stream that puts displayable globs together
and doesn't interpose cursor re-positioning sequences in random places,
there should be nothing much difficult about managing a "live" interactive

What I would say is... the programming of terminal emulators is easier to
manage when the data entities in the host communication protocols correspond
to pre-composed sequences of Unicode characters. And in such protocols, it
is not unreasonable to require a particularly small, rigidly circumscribed,
and brutally normalized subset of Unicode data as input to the processing.

Terminal emulation is going to be with us for a while, certainly. But I
wouldn't call it the wave of the future in user/host interactions. It is,
and will continue to be, slowly replaced by WWW form protocols, etc, etc.
You can see it happening all over the place at an increasing pace. And in
THOSE interactions between the end-user/client and the host system, there
need be no particular restriction about the data the user types, nor does
there need to be any prescription of the user's system itself about what can
and cannot be displayed. The display capbability depends on the end-user
system -- the "host" doesn't care -- and the data that is shoved around in
their interactive session is STILL NOT Unicode plain text, it's Unicode text
embedded in a higher-level protocol. Which protocol of course is free to
prescribe any amount of rigidity in the data and format which it allows.
Some protocols are going to say "anything goes" and some will be more rigid.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT