RE: New FAQ page

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 00:54:28 CDT

  • Next message: Raymond Mercier: "Re: New FAQ page"

    Doug Ewell
    > Envoyé : samedi 13 octobre 2007 06:01
    > À : Unicode Mailing List
    > Objet : Re: New FAQ page
    >
    > Peter Constable <petercon at microsoft dot com> wrote:
    >
    > > Actually, I think what's happening is that "?" is used as the default
    > > code page mapping for characters not supported in a code page. So, if
    > > an app takes Unicode data and pumps it into (say) code page 1252, then
    > > a character like U+0915 will map into 0x3F "?".
    >
    > This is explicitly specified in the "best fit" mappings on the Unicode
    > site, which are based on .NET behavior, as Peter knows:
    >
    > CPINFO 1 0x3f 0x003f ;Single Byte CP, Default Char = Question Mark

    Note however that the codepage conversion API (in the Windows SDK) allows an
    application to specify the behaviour for unmapped characters. Only the
    default value of this API is using a mapping to a question mark, but other
    behaviour is possible: returning en error or exception, using another
    default mapping (for example a SUB control).

    For online information about the Windows SDK API, you need now to look into
    the .Net documentation, where it is now documented in the "Microsoft.Win32"
    namespace (plus many references to the "System" namespace for basic
    datatypes and structures, and many of its sub-namespace for non-core
    services). This makes the Windows API difficult to use if you don't want
    .Net (for example when you just want to program in C or C++)

    But if we just consider the .Net API, here is the relevant one in the
    "System.Text" namespace:

    namespace System.Text;
    class Encoding {
       static Encoding GetEncoding(
            Int32 codepage,
            EncoderFallback e,
            DecoderFallback d);
    }
    which is a static factory to get a encoder/decoder pair using two subtypes
    fallbacks: replacement fallbacks (where the replacement is not limited to
    one character, but may be any string in the target encoding) and exception
    fallbacks.

    There are some examples in:
    http://msdn2.microsoft.com/fr-fr/library/system.text.decoderreplacementfallb
    ack(VS.80).aspx

    and in:
    http://msdn2.microsoft.com/fr-fr/library/system.text.encoderreplacementfallb
    ack(VS.80).aspx

    (But no info is given about how the .Net core library maps these methods to
    the Win32 API, which has similar services, that are now more difficult to
    find except in legacy header files provided with Visual C/C++ ; it seems
    that Windows will progressively abandon the documentation of its native core
    API, and move everything to .Net which will remain the only documented and
    stable/portable API, complicating the work for C/C++ developers if they
    don't know how .Net works).

    Note that .Net uses the (quite abusive) class name "UnicodeEncoding" for
    actually meaning the UTF-16LE encoding; other Unicode-defined encodings
    which are predefined in the Microsoft .Net core library are named
    "BigEndianUnicode" (UTF-16BE), "UTF32Encoding", "UTF8Encoding",
    "UTF7Encoding").

    Nothing in the definition of the .Net library indicates which internal
    encoding is used, because even the "Char" datatype is a class whose internal
    representation is hidden, we just know the min and max value of this
    datatype using some internal integer interval datatype, "wchar_t" in C/C++,
    "char" in J# and VB, which is not necessarily the same internal datatype
    used for storing strings; however I can't see how it can store more than 16
    unsigned bits, and the .Net documentation is really abusive when it says
    that a "char" in .Net represents "a Unicode character", when in fact it
    cannot represent a single Unicode character out of the BMP, without using
    TWO "char" in .Net (here it will necessarily use surrogates). This is
    reflected in the length() method of the System.string class...

    Really, if you read the .Net documentation, its terminology does not match
    the Unicode definition of the same terms (the definition of the "char" and
    "string" datatype being the most confusing).

    The actual conversion from strings to arrays of bytes is performed now as a
    method of the Encoding interface (overridden in each of its implementation
    class, where fallbacks are used and also overridable). However, no fallback
    will be ever called when converting ***to*** one of the Unicode-based
    encodings, i.e. in Unicode based encoders (the reverse is not true for
    decoders used to parse a sequence of bytes into the internal sequence of
    chars).

    Follow the other links for getting the list of other supported codepages
    (only the Unicode-based encodings are part of the .Net core, all others are
    supported by using codage definitions installed on the system, or defined by
    the application by implementing the Encoding interface within your own
    classes).



    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 00:57:30 CDT