Ulrich Drepper wrote on 1999-01-12 16:02 UTC:
> > I hope that eventually xterm can be started with some "-utf8" option and
> > then the displayed text will be interpreted as UTF-8, the keyboard
> > generates UTF-8 codes, and cut&paste functions will operate with UTF-8
> > as well.
>
> I'd suggest to use escape codes to do soft-switches.
The switching mechanisms that I would suggest for xterm are:
- command line options -utf8 and +utf8
- corresponding resource entries
- environment variables (find the string "UTF-8" or "utf-8"
somewhere in LC_CTYPE).
- the ESC sequences
ESC % G for activating UTF-8 (allowing to leave again)
ESC % @ for leaving this mode again
ESC % / G \
ESC % / H > for activating UTF-8 (*not* allowing to leave again
ESC % / I / via ESC % @, i.e. permanently
leaving the ISO 2022 world for
the rest of the session)
as specified in <http://www.itscj.ipsj.or.jp/ISO-IR/> and
<ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/ISO-10646-UTF-8.html>.
How to activate a UTF-8 mode in Linux applications is an important
question:
For applications (more, less, vi, emacs, etc.) we will need a standard
convention of how to tell them that the system character encoding is now
UTF-8 and not (as most applications assume today by default, unless they
were written under Plan9) ASCII or ISO 8859-*.
The best convention I can think if is to search for the substring
"UTF-8" in the environment variable LC_CTYPE, just like emacs is
activating its 8-bit mode if it finds the string "8859" in LC_CTYPE.
What do you think about that approach?
How is glibc 2.1 going to detect whether the character encoding is UTF-8
or not? Same LC_CTYPE convention?
Or should the application call some libc mb* function to test whether
UTF-8 has been selected somehow via LC_CTYPE?
I would like to see a trivial application that has to count characters
in text strings (e.g., "wc" or "more") to be made correctly UTF-8
capable, as an example for C programmers to understand how to program
correctly in a world where 1 byte == 1 character does not hold any more,
because bytes of the form 10xxxxxx must not be counted as separate
characters in UTF-8.
Markus
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT