Re: UTF8 locale & shell encoding

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jan 16 2004 - 12:53:34 EST

  • Next message: Michael \(michka\) Kaplan: "Re: UTF8 locale & shell encoding"

    From: "Rick Cameron" <Rick.Cameron@businessobjects.com>
    > Unfortunately, you cannot use UTF-8 as the default MBCS code page in
    > Windows. In other words, Windows does not support the equivalent to
    setting
    > the locale to xxx.UTF-8 in unix.

    Exactly. However the conversion to UTF-8 from UTF-16 (the Windows "WideChar"
    encoding used in the Win32 Unicode API) is supported natively in
    MultibyteToWideChar() as if it was a SBCD/DBCS character set, even on
    Windows 95.

    > But the good news is that in Windows (unlike unix), wchar_t always means
    > UTF-16. And UTF-16 is a whole lot more convenient to work with than UTF-8!

    I fully agree there. UTF-16 is really convenient as the main encoding to use
    for Windows programming and interaction with the Win32 API, specially if you
    are on Windows 95, because Windows 95 will require you convert first UTF-8
    to UTF-16 wide chars before you can map it to the ACP (or OEMCP) codepage
    which are often the only one really supported and working in many areas.

    With UTF-16 you need only 1 conversion, and if you are working on an
    application that needs internationalization, it's just easier to port your
    application from the 8-bit ACP/OEMCP legacy codepages to the "WideChar"
    UTF-16 encoding. UTF-16 will not cost you much more resources than UTF-8 for
    all non-English users (after all if you are internationalizing your
    application, it's legitimate to think that ASCII will not be the only
    characters that your application will use, and for other European languages,
    the price of UTF-16 face to UTF-8 is not excessive. On the opposite the
    price of UTF-8 for non European users is prohibitive: UTF-16 competes well
    face to legacy Asian MBCS charsets.)

    In conclusion, on Windows, UTF-16 is the encoding that requires the least
    number of conversions performed at your programmatic level, so you save
    performance by avoiding conversions and allocation of working buffers, and
    various copies of the string in multiple encodings (notably on Windows95).
    So you'll need to worry about legacy charsets only if you need to use some
    legacy DOS APIs for console apps on Windows 95 (and here again, only 1
    conversion is needed within your console support layer for standard input &
    standard output/error).

    If you need a serialization of UTF-16 for file storage, it can be performed
    on the fly to UTF-8 with very basic code in your file or stream layer,
    without needing complex buffer management and complex management of encoding
    issues; the only thing you must take care about is the possible bogous
    presence of unpaired surrogates, something that won't affect you immediately
    if you have never used UTF-16 data before, and you design your string
    handling routines to preserve UTF-16 pairs, or if your first step is to
    support languages which still don't need surrogates for characters out of
    the BMP (i.e. today, mostly extended Chinese): you can make sure your
    program will not break there in the future by making sure that you won't
    accept surrogates on input until you have verified that your string handling
    routines are preserving surrogate pairs, even in the case you need to
    perform truncation of strings with bounded lengths.



    This archive was generated by hypermail 2.1.5 : Fri Jan 16 2004 - 13:27:46 EST