Re: UTF8 locale & shell encoding

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jan 16 2004 - 10:36:15 EST

  • Next message: Martin Heijdra: "Order of fonts/encodings used in Internet Explorer"

    From: "Jon Hanna" <jon@hackcraft.net>
    > > It would be good to say that this depends on the compiler tool you use,
    and
    > > its version...
    >
    > True, I was refering here to VisualC++. The naming convention has been
    > relatively stable for the last few versions IIRC.
    >
    > There's nothing less portable _on Windows_ than the "standard
    > > C/C++ library", which try to mimic more or less successfully what is
    offered
    > > on Unix/Linux and other POSIX systems...
    >
    > It's not a good idea to code as if other values have any degree of
    > cross-compiler and/or cross-platform stability unless you are explicitly
    coding
    > to a standard which does define them (as I believe POSIX does) in addition
    to
    > standards related to C++ itself.

    Note the words that I underlined: _on Windows_

    The most portable way _on Windows_ to convert between UTF-8 and ACP/OCP
    codepages remains the MultiByteToWideChar API. Sure it is a Windows specific
    API, but it will be supported by all C/C++ compilers for Windows, whatever
    their level of support for locales (most compilers on Windows have very
    Basic support for locales, and weak or no support for other locales than
    "C"). So any code that depends on locale names on Windows is very likely to
    fail because of the absence of support for other POSIX locales than "C".

    Of course you can choose which compiler and version to use if you build your
    own binaries. But if you want to make the _source_ code portable, you then
    need a compromize, by specifically saying which compiler and version you
    support with this source code.

    The question from Deepak is then correctly answered: POSIX locales are not a
    great help for Windows where there's not even a system environment setting
    to define it: you need a POSIX emulation layer to artificially infer a POSIX
    locale based on system locale information as seen with the Win32 APIs like
    getACP() or getOEMCP() and other APIs to get the user's regional settings
    for language and number/date formatting. Such emulation layer is built in
    the port of Java VM and core libraries for Windows.

    Of course you can use functions like wcs* mbs* functions on Windows, but
    conversion of character encodings is to build yourself, or by using the
    support functions built into the standard C library of a specific compiler
    and version, which most of the time will only be able to convert between the
    ACP code page (used by mbs* functions and the ANSI version of Win32 APIs)
    and UTF-16 (used by wcs* functions and the _UNICODE version of the Win32
    API). Standard C libraries for Windows that are based on (char*) strings
    assume most of the time that filenames will be given in the local ANSI
    codepage (see the result of getACP()), or that output to a console or
    DOS-emulation functions will use the OEMCP, or that calls to _UNICODE
    versions of Win32 API with (wchar_t*) strings will use UTF-16.

    Thanks, the MultiByteToWideChar() API (and the reverse) is working correctly
    even on Windows 95 provided that it is limited to convert between UTF-16 on
    one side and ACP or OEMCP or UTF-8 on the other side.

    Support for other codepages (including Windows-1252 on non-European versions
    of Windows) is not guaranteed (won't work on Windows 95, may work on Windows
    2000/NT/XP provided that these extra codepages have been installed by the
    Administrator in the regional settings configuration panel).



    This archive was generated by hypermail 2.1.5 : Fri Jan 16 2004 - 11:11:57 EST