Detecting UTF-8 Locale Question

From: Edward H Trager (ehtrager@umich.edu)
Date: Tue Mar 25 2003 - 12:01:11 EST

  • Next message: Kent Karlsson: "RE: Several BOMs in the same file"

    Hope some of the gurus with programming experience who read this list can
    provide me with some additional insight or pointers to resources about the
    following (NOTE: I've already looked at Markus Kuhn's FAQ):

    QUESTIONS:

    (1) Is examination of the LC_CTYPE environment variable on UNIX-like
    environments a sufficient way of detecting locale?

    (2) Are there a UTF-8 competent terminals available for OS X/Darwin, or
    should one just use an X-based terminal like mlterm or xterm?

    (3) Aside from xterm or mlterm running under Cygwin, are there other UTF-8
    competent terminals available on Win32? Which one are "the best"?

    I don't mind subjective responses regarding which are "the best"
    terminals. For example, I would personally rank mlterm as much more
    capable than xterm since it handles Arabic, Hebrew, and Indic scripts.

    DETAILS:

    I'm writing an interactive console-based program (i.e., started from
    xterm, mlterm, or other terminal emulator) for UNIX-like environments
    (this would include Cygwin on Win32 and Mac OS X/Darwin in addition to the
    obvious other ones like Solaris and Linux) which will support just two
    "locales": the ASCII subset of UTF-8, and UTF-8. That's it! For UTF-8,
    initially the program will support plane 0 (BMP). Support beyond plane 0
    probably won't ever be necessary.

    My initial plan for finding out about the current locale is that the
    program will, at start up, look at the LC_CTYPE environment variable. If
    that variable is defined and contains the substring "UTF-8" or regex-able
    variants thereof (like "utf8" on Linux), then everything is fine. If not
    present, the program prints a warning message to the user suggesting they
    set the locale to a UTF-8 locale and provides an example of how to do
    that. If the locale is not set properly, the program still functions, but
    of course any UTF-8 encoded data will not be displayed properly on the
    terminal.

    (Of course, even if a locale *is* set to a UTF-8 locale, it doesn't
    guarantee that UTF-8 data will be displayed properly because (1) glyphs
    still may not be available in the fonts on the system (2) the terminal may
    not handle the script properly (i.e., when I last checked, xterm didn't
    handle Indic or RTL scripts)).

    If anybody sees limitations to this approach (Actually, I'm hoping you
    will!), please let me know.

    This approach seems sufficient using xterm under Cygwin and mlterm on
    Linux and OpenBSD, and I haven't got around to testing with Solaris yet.
    There might, for example, be much better ways to do it on Cygwin/Win32
    that I don't know about. Also, I don't have a clue how to do it on OS X.



    This archive was generated by hypermail 2.1.5 : Tue Mar 25 2003 - 12:55:56 EST