[OT] ASCII support in C/C++ (was: doubt)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 10 2004 - 16:59:04 EST

  • Next message: Clark Cox: "Re: [OT] ASCII support in C/C++ (was: doubt)"

    ----- Original Message -----
    From: "John Cowan" <cowan@mercury.ccil.org>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    Cc: <unicode@unicode.org>
    Sent: Saturday, January 10, 2004 7:31 PM
    Subject: Re: doubt

    > Philippe Verdy scripsit:
    >
    > [much useful stuff snipped]
    >
    > > A source-code symbolic character literal like 'A' is not guaranteed to
    > > compile (but it's unlikely that there's no character LATIN CAPITAL
    LETTER A
    > > in the runtime charset), so be careful with some characters like '['
    which
    > > may not exist in all ISO-646 compatible run-time charsets.
    >
    > There is a concept of the minimal runtime charset: it must include the
    > ASCII letters and digits and some others.

    This is needed to support the ANSI C library, but it is not I think, a
    requirement in the language itself; by language I mean here its compiler on
    a particular platform, that transforms a source file by interpreting it with
    a source charset into a "runtime" charset.

    Still the term "runtime" charset is quite confusive because it just
    designates the charset to which strings and character constant in the source
    file are converted to in the binary file, without meaning that the compiled
    application will be effectively using that charset, or that this will be the
    charset used in the environment where the compiled application will be run.

    Not all platforms are able to represent in a single char every ASCII letter
    and digits (or other characters in the "invariant" subset of ISO-646 and
    EBCDIC), notably 4 bit microcontrolers, despite you could use C or C++ to
    write software that would work on such limited platform.

    I think that a more modern approach to support 4 bit controlers, or even a
    new 32-bit or 64-bit processor that would allow to use bit-addressable
    memory, would be to port the compiler so that sizeof(char) still equals 1,
    without necessarily meaning that all adressable memory needs to be aligned
    on char boundaries: so a point to char could as well use a physical
    bit-address internally, where incrementing a char pointer in fact adds 8 to
    the pointer.

    The standard C/C++ libraries would work in such environment, because there's
    no requirement for the required condition "sizeof(char)=1" meaning that the
    physical address is incremented by 1, just the requirement that the "char"
    datatype must be the minimum allocatable unit of memory when using
    malloc()/free(), and that this datatype should be large enough to store at
    least ASCII uppercase letters, digits and a few symbols (this means that a
    "char" would need to be at least 6 bits).

    Nothing forbids the compiler to add its own datatype for actual separately
    addressable memory units smaller than a char; for example a "__bit" type
    which would be in fact 1 bit only, and which could not be allocated with
    malloc() and free(), and which would have these properties:
        sizeof(__bit) == 0 (not allocatable by malloc()/free()), but
        __bitsizeof(__bit) == 1, and
        __bitsizeof(char) == 8, and
        (char*)(charArray + 1) - (char*)(charArray) == 1 as expected, but also
        (__bit*)(charArray + 1) - (__bit*)(charArray) == 8;
    and with the possibility to create arbitrary pointers on this "smaller than
    char" datatype. For safety, the processor could require some memory
    alignments when handling data larger than a single memory unit. If
    necessary, the "__size_t" datatype could be actually a fixed point number,
    whose conversion from/to a standard integral typewould include a shift
    operation, but that would allow defining:
        __sizeof(__bit) == (__size_t)0.125
        __sizeof(char) == (__size_t)1.000 == 8 * __bitsizeof(__bit)

    So the minimum 8-bits needed to support char in most programs and library
    would continue to work without modification in the source code, even if it's
    an artificial construct of the compiler which hides the details of the way
    addresses are internally computed.

    On such platform, it would then still be possible to support a ASCII
    compatible "runtime" charset and support as well UTF-8 encodings and other
    classic Unicode encodings, such as UTF-16 used with a 16-bit "short"
    definition of "wchar_t", or UTF-32 mapped to an 32-bit "long".

    On a bit-addressable platform, a wchar_t datatype could be as well be
    defined as a 21-bit single code unit if there's no alignment constraints for
    reading/writing words made of multiple memory units with distinct addresses:
    incrementing a wchar_t pointer would physically add 21 to that pointer...

    There's lots of solutions for a compiler to maintain the preconditions on
    chars needed to support ANSI C and a minimum "runtime" charset, even if the
    platform allows accessing units smaller than a char.



    This archive was generated by hypermail 2.1.5 : Sat Jan 10 2004 - 17:31:38 EST