UTF-8, C1 controls, and UNIX

From: Frank da Cruz (fdc@columbia.edu)
Date: Wed Feb 28 2001 - 16:35:46 EST


The idea behind UTF-8 is to be able to use it in non-Unicode-aware UNIX
versions: It lets you have Unicode filenames, Unicode directory names,
Unicode file contents, Unicode email, etc. But what it does not do is let
you *type* Unicode into regular UNIX applications or shells, if the UTF-8
happens to contain C1 control characters as do, for example, many of the
Cyrillic letters (e.g. capital A through PE). Most UNIX terminal drivers
treat incoming C1 controls like their C0 counterparts, so 0x83 == 0x03 ==
Ctrl-C, which interrupts whatever process you are talking to. Similarly
0x84 == Ctrl-D, which is EOF; 0x88 is backspace, and so on.

I suppose this is a statement of the obvious, but now that I'm using a
Unicode based terminal emulator with UTF-8 character set and trying to
compose e-mail and netnews containing Russian words in a Telnet session to
UNIX, the problem is suddenly concrete. We have said that UTF-8 is a kind
of "transport form" that must be decoded prior to (e.g.) terminal escape
sequences in the host-to-terminal direction. That's fine, the terminal
emulator can (and does) do that. But in the other direction there is no
such decoder on the UNIX end. The bare C1 octets are read by the UNIX
terminal driver, which treats them as interrupt, suspend, xoff, tab,
carriage return, linefeed, and all the rest. Here the model breaks down --
it is not symmetric.

The nice thing about ISO 8859-1 was that it could be freely used in UNIX,
in both directions, without UNIX knowing a thing about it. The same is not
true for UTF-8.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT