Re: doubt

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 10 2004 - 08:05:19 EST

  • Next message: Clark Cox: "Confusion about composition"

    From: "Deepak Chand Rathore" <deepakr@aztec.soft.net>
    > Hi all,
    >
    > Compiler internal encoding might affect the encoding of the hardcoded
    > literals with in a source file
    > As a result after compilation we might interpret wrong characters.
    > If we have hardcoded only ascii literals with in the program( source file)
    > and left the compiler encoding to it's default;
    > Is there any possibility that after compilation, in the object file
    > produced, encoding of literals get affected.
    > (as far i know almost all compiler's default encoding( default locale C in
    > C++ ) is ascii compatible)
    > this problem, i am refering wrt C++
    > Are there any other issues related to this subject , any useful links

    In C/C++, a char (or wchar_t) variable can also be used as an integer type.
    This means that the compiler is not allowed to compile something else than
    the specified integer constant (or the character constant coded with '\xHH'
    or '\ooo').

    However things are different for character or string constants (including
    symbolic constants like '\n' but excluding occurences of '\xHH' and '\ooo'
    in strings) that are interpreted and compiled into integers using a
    compile-time conversion from the source charset to the run-time charset
    (both of which are not necessarily ASCII-based). Such literals are handled
    as symbolic to represent unspecified integer values.

    If you think you must have exact numeric identity at run-time (independantly
    of the source or run-time charset) for strings, then use constant arrays of
    integer values or strings completely encoded with '\xNN' or '\ooo' instead
    of symbolic literals like "Abcd" or L"ABCD" (i.e. encode them like
    "\x41\x62\x63\x64" or L"\x41\x62\x63\x64").

    Note that the '\uHHHH' or L'\UHHHHHH' notations uses a compile-time
    conversion from the Unicode charset to the run-time charset. So there's no
    guarantee that the following source-code assertions will be TRUE:
        * ('\u0041' == 0x41) may be false
            if the runtime charset (as specified or infered at compile-time) is
    EBCDIC for example;
        * (L'\U00000041' == 0x41) may be false
            also for the same reason.

    Note that a C/C++ compiler may support the '\uHHHH' or '\UHHHH' symbolic
    notations but may refuse to compile it because there's a conversion error
    (missing mapping) from Unicode to the runtime charset. This case happens for
    example on Windows when not compiling for UNICODE, with the symbolic
    literals '\u0080' or L'\U00000080' which unambiguously designate the first
    C1 control character by its Unicode hexadecimal code point, but which does
    not exist in the runtime ANSI or OEM charset (the runtime charset being
    selected by a compiler option or by some compiler-specific pragmas.). So the
    following may be FALSE:
        * ('\u0041' == '\x41') may be false
            if the runtime charset (as specified or infered at compile-time) is
    EBCDIC for example;
        * (L'\U00000041' == 'x41') may be false
            also for the same reason;

    And the following source-code assertions may be FALSE depending on compiler
    capabilities and compilation options or pragmas (there's no consideration
    here about the source or runtime charsets):
        * ('\xFF' == 0xFF) and ('\377' == 0377) may be false
            if the default char datatype is signed;
        * ('\xFF' == -1) and ('\377' == -1) may be false
            if the default char datatype is unsigned AND has no more than 8
    bits.
        * (L'\xFF' == 0xFF) and (L'\377' == 0377) may be false
            if the default wchar_t datatype is signed AND has no more than 8
    bits only;
        * (L'\xFF' == -1) and (L'\377' == -1) may be false
            if the default wchar_t datatype is unsigned.

    But the following source-code assertions will be all TRUE (there's no
    consideration here also about the source or runtime charsets):
        * ('\x41' == 0x41) and (L'\x41' == 0x41) are true and will compile, if
    the char datatype is at least 7 bits.
        * ('\177' == 0177) and (L'\177' == 0177) are true and will compile, if
    the char datatype is at least 8 bits.
        * ('\0' == 0) and (L'\0' == 0) are always true and will always compile.

    Note however that the following source-code assertions will all be TRUE
    provided they compile:
        * ('\u0041' == L'\U00000041') will be always true if it compiles.
        * ('\u0041' == 'A') will be always true if it compiles.
        * ('\U00000041' == 'A') will be always true if it compiles.

    A source-code symbolic character literal like 'A' is not guaranteed to
    compile (but it's unlikely that there's no character LATIN CAPITAL LETTER A
    in the runtime charset), so be careful with some characters like '[' which
    may not exist in all ISO-646 compatible run-time charsets.



    This archive was generated by hypermail 2.1.5 : Sat Jan 10 2004 - 08:36:21 EST