Re: C1 controls and terminals (was: Re: Euro character in ISO)

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Jul 13 2000 - 11:47:50 EDT


Frank da Cruz <fdc@columbia.edu> wrote:

>> This is the widely reported compatibility problem between UTF-8 and
>> terminals. I know I read somewhere, possibly on Markus Kuhn's Unicode
>> page, possibly somewhere else, that ISO 2022 codes exist to switch out
>> of "ISO 2022 mode" and into "UTF-8 mode" and to either allow or prevent
>> switching back to 2022. Is there any progress on implementing this so
>> terminals and emulators can live with UTF-8?
>>
> Maybe Markus can clarify. I would be surprised if there's anything in
> ISO 2022 about UTF8, except that it does provide a way to switch out of
> and back into ISO 2022 mode, allowing the use of character sets that do
> not comply with ISO 2022 and 4873. That's what the designating escape
> sequences "with standard return" and "without standard return" are for.

Well, I didn't want to push the burden for explaining this onto Kuhn,
but upon reading his page again I found the relevant section:

> The ISO 2022 standard specifies a range of ESC % sequences for
> leaving the ISO 2022 world (designation of other coding system,
> DOCS), and a number of such sequences have been registered for UTF-8
> in section 2.8 of the ISO 2375 International Register of Coded
> Character Sets:
>
> * ESC %G activates UTF-8 with an unspecified implementation level
> from ISO 2022 in a way that allows to go back to ISO 2022 again.
> * ESC %@ goes back from UTF-8 to ISO 2022 in case UTF-8 had been
> entered via ESC %G.
> * ESC %/G switches to UTF-8 Level 1 with no return.
> * ESC %/H switches to UTF-8 Level 2 with no return.
> * ESC %/I switches to UTF-8 Level 3 with no return.
>
> While a terminal emulator is in UTF-8 mode, any ISO 2022 escape
> sequences such as for switching G2/G3 etc. are ignored. The only ISO
> 2022 sequence on which a terminal emulator might act in UTF-8 mode is
> ESC %@ for returning from UTF-8 back to the ISO 2022 scheme.
>
> UTF-8 still allows you to use C1 control characters such as CSI, even
> though UTF-8 also uses bytes in the range 0x80-0x9F. It is important
> to understand that a terminal emulator in UTF-8 mode must apply the
> UTF-8 decoder to the incoming byte stream *before* interpreting any
> control characters. C1 characters are UTF-8 decoded just like any
> other character above U+007F.

(from http://www.cl.cam.ac.uk/~mgk25/unicode.html)

That last paragraph echoes what Frank said about "reversing the layers,"
performing the UTF-8 conversion first and then looking for escape
sequences. True UTF-8 support, in terminal emulators and in other
software as well, really should depend on UTF-8 conversion being
performed first.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT