Re: Unicode support under Linux

From: Markus Kuhn (
Date: Tue Jan 12 1999 - 15:25:51 EST

Ulrich Drepper wrote on 1999-01-12 16:02 UTC:
> > I hope that eventually xterm can be started with some "-utf8" option and
> > then the displayed text will be interpreted as UTF-8, the keyboard
> > generates UTF-8 codes, and cut&paste functions will operate with UTF-8
> > as well.
> I'd suggest to use escape codes to do soft-switches.

The switching mechanisms that I would suggest for xterm are:

   - command line options -utf8 and +utf8
   - corresponding resource entries
   - environment variables (find the string "UTF-8" or "utf-8"
     somewhere in LC_CTYPE).
   - the ESC sequences

        ESC % G for activating UTF-8 (allowing to leave again)

        ESC % @ for leaving this mode again

        ESC % / G \
        ESC % / H > for activating UTF-8 (*not* allowing to leave again
        ESC % / I / via ESC % @, i.e. permanently
                                             leaving the ISO 2022 world for
                                             the rest of the session)

     as specified in <> and

How to activate a UTF-8 mode in Linux applications is an important

For applications (more, less, vi, emacs, etc.) we will need a standard
convention of how to tell them that the system character encoding is now
UTF-8 and not (as most applications assume today by default, unless they
were written under Plan9) ASCII or ISO 8859-*.

The best convention I can think if is to search for the substring
"UTF-8" in the environment variable LC_CTYPE, just like emacs is
activating its 8-bit mode if it finds the string "8859" in LC_CTYPE.

What do you think about that approach?

How is glibc 2.1 going to detect whether the character encoding is UTF-8
or not? Same LC_CTYPE convention?

Or should the application call some libc mb* function to test whether
UTF-8 has been selected somehow via LC_CTYPE?

I would like to see a trivial application that has to count characters
in text strings (e.g., "wc" or "more") to be made correctly UTF-8
capable, as an example for C programmers to understand how to program
correctly in a world where 1 byte == 1 character does not hold any more,
because bytes of the form 10xxxxxx must not be counted as separate
characters in UTF-8.


Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at,  WWW: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT