Re: CodePage Information

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 22 2003 - 06:06:28 EDT

  • Next message: Peter_Constable@sil.org: "marks in Modern Hebrew, Yiddish"

    From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
    > Hi,
    > We have a GUI software that is running on win NT/2K/XP/95/98/ME. The
    > software is developed using Visual Studio 6 and has a common source code
    > base for all supported Win OS.
    > The GUI software interacts with a hardware to get running data to be
    > displayed on the UI. The GUI is to be internationalized. The Hardware does
    > not understand UNICODE and can understand only ASCII. The minimum language
    > support of the GUI software are Chinese, Dutch, French, German, Italian,
    > Japanese, Portuguese, Spanish, UK Eng.
    > We tried having our own protocol (including Unicode BOM) to achieve this.
    > What we did was to push a byte stream of Unicode characters. This did not
    > work well since the hardware truncated the Unicode string to the first NULL
    > character received!
    > Then we tried to Ascii encode the string (i.e. if the Unicode string was
    > "In", the Unicode string would be "42 00 69 00", the ASCII encoded string
    > would be "34 32 30 30 36 39 30 30". Thus we are encoding 4 as 34 (ASCII
    > value), 2 as 32, 0 as 30 etc..). But I do not like the idea. This has
    > various pitfalls. ( Interested parties can contact me for more details!).
    > Also the problem is that the Hardware is an intelligent device and stores
    > some of the Unicode Strings. Since previous versions of GUI were Ascii and
    > there are huge installations of the software/hardware, chances are that
    > there will be hardwares having Ascii strings in them even if we migrate the
    > UI to Unicode. We should be thus able to handle a mix of Unicode and Ascii
    > characters.
    >
    > Hence after a typical brainstorming session we decided to not throw Unicode
    > strings at the hardware at all, instead convert them to Ascii as send. We
    > thought of using WideCharToMultiByte() and MultiByteToWideChar() API's to do
    > the same. These functions use CodePage to do the conversions.
    > Do you think it is a wise idea to use this method?

    Use the very simple UTF-8 encoding scheme (or the even more compact SCSU or BOCU-8 encodings) to serialize Unicode strings to your device, and you won't suffer of the NULL byte problem you have with byte encoding schemes derived from UTF-16 (the encoding you probably use in your GUI)...

    The transformation between UTF-8 and UTF-16 is extremely simple and can be computed without ever needing a conversion table.
    The Windows API already provides you this conversion, but you can use your own which is just a dozen of lines of C, C++, C# or Java code...

    WideCharToMultiByte() and MultiByteToWideChar() API's already support special "codepages" available in all localization of Windows for standard "Unicode Transformation Formats (UTF)". Reread it... Don't use any Windows ANSI or OEM codepage if you want your software to support multiple languages (notaly because all these codepages are not available in all localizations of Windows (notably Windows 95/98/ME but also Windows NT/2000/XP where it requires installing supplementary codepages in the Regional Settings).

    All you need to do is to isolate this conversion in the module that communicates with your hardware device which does not support NULL bytes and thus cannot support encoding schemes derived from UTF-16 or UTF-32.

    Be careful with WideCharToMultiByte() and MultiByteToWideChar() : they can raise exceptions when a conversion fails, or can optionally substitute a replacement character (but this replacement option does not work in Windows 95/98/ME which ignore this parameter or hang); such exceptions/errors can be thrown/returned if there are invalid conversion sequences, that may break the critical functions to communicate with your hardware device.

    That's why I suggest you use your own simple conversion between UTF-8 and UTF-16, as it is so simple to implement, and allows you to manage reasonnable "fallbacks" for some invalid/illegal character sequences, according to your own standard.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 07:06:40 EDT