Re: UTF-8, C1 controls, and UNIX

From: Frank da Cruz (
Date: Fri Mar 02 2001 - 10:42:48 EST

> On Thu, 1 Mar 2001 11:00:45 -0800 (GMT-0800), Frank da Cruz
> <> wrote:
> This information may be a bit outdated, since it is more than a decade
> since I worked daily with VMS.
> >VMS is an example of a platform that really, really takes advantage of
> >ISO standards.
> Or the other way around, since ISO-8859-1 came after DEC-MCS.
Yes, ISO 8859-1 was patterned on DEC MCS. But once ISO 8859-1 came out,
DEC supported both. For example, in the VT320 setup mode, you can select
either one.

> >When you log in to VMS, it sends an escape sequence to
> >query the terminal.
> This only happens if you have the set terminal/inquire command in the
> or command files. With non-VT100/200/300 series
> compatible terminals connected to a host, it is better to leave out
> this command at least from, since the sequence might cause
> havoc with these terminals.
This has little to do with Unicode, but yes, what you say is correct, but
in my experience I have never seen a VMS system that did not send the
"what are you?" escape sequence from the system profile.

> >If the response indicates C1 capability, the host
> >sends an escape sequence commanding the terminal into C1 mode. This is
> >how it has worked for nearly 20 years.
> If I remember correctly, there is a command like
> set terminal/noeightbitcontrols
> that at least prevents sending out C1 controls, but I do not remember,
> what it does to input characters in the C1 range.
You can disable the terminal sending C1 controls, but that does not prevent
the host from accepting C1 controls AS CONTROLS.

This discussion is becoming a bit esoteric, but I think there is still a
lesson in it. DEC -- the maker of VMS, the VT220 terminal, etc -- was
devoted to ISO standards. They played by the rules and this played a
large part (but by no means the only part) in their downfall. One obvious
example that has nothing to do with Unicode is their enormous efforts to
support the OSI networking model, at the expense of TCP/IP. Look where
*that* got them. Yes, when you look at the addressing, security, and
performance problems of the Internet today, OSI doesn't look so bad;
maybe those standards committees knew a thing or two after all.

More on-topic, however, is DEC's attitude towards character sets. They
supported and vigorously upheld ISO 4873, 2022, and 6429 in their products.
A case in point is the long-dead Rainbow MS-DOS PC, a direct competitor
in the early days of Intel-808x based PCs when IBM compatibility was
not yet a given. DEC's approach to the console was radically different
from IBM's: the console was an ISO 2022 (VT220) terminal. It was formatted
using escape sequences, not by poking magic values into magic locations in
the display adapter. Just like the IBM PC, you could put forms on the
screen, but you did it by the rules, using ISO 2022 to switch among
ASCII, ISO 8859-1 (or whatever), and the line/box-drawing character set.

Interestingly, you could boot not only MS-DOS, but also CP/M-80 and
CP/M-86, *and* you could use it as a terminal without booting anything,
and the screen could be formatted in the same way in all these
environments. This approach made a huge amount of sense, but it was "too
hard" for most developers to understand. It was so much easier to write a
BASIC program that poked magic numbers into magic addresses, and that's
what everybody did. It's how we wound up with so-called "ANSI" terminal
emulation (perhaps the worst misnomer in all computing history), and the
atrocious practice of putting proprietary PC code pages on the wire in
open communications protocols like TCP/IP.

The lesson is that a standard that is hard to understand or implement, or
that gets in the way of people trying to do what they want, will be
bypassed or ignored. UTF-8 might be heading in this direction. It's
popular and widely used today, but look at what's going on in Linux land.
They are developing UTF-8 terminal windows, and people will use them.
Eventually they will run up against the C1 problem, either in Linux itself,
or when trying to interoperate between Linux and non-UTF8 aware hosts.

- Frank

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT