Re: Misuse of 8th bit [Was: My Querry]

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Nov 26 2004 - 09:23:38 CST

  • Next message: Philippe Verdy: "Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)"

    From: "Antoine Leca" <Antoine10646@leca-marti.org>
    > On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
    >>
    >> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
    >> the range 0 to 127. Nothing is defined outside of this range, exactly
    >> like Unicode does not define or mandate anything for code points
    >> larger than 0x10FFFF, should they be stored or handled in memory with
    >> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
    >> architecture or network framing constraints.
    >> So the question of whever an application can or cannot use the extra
    >> bits is left to the application, and this has no influence on the
    >> standard charset encoding or on the encoding of Unicode itself.
    >
    > What you seem to miss here is that given computers are nowadays based on
    > 8-bit units, there have been a strong move in the '80s and the '90s to
    > _reserve_ ALL the 8 bits of the octet for characters. And what was asking
    > A.
    > Freitag was precisely to avoid bringing different ideas about
    > possibilities
    > to encode other class of informations inside the 8th bit of a ASCII-based
    > storage of a character.

    This is true for example in an API that just says that a "char" (or whatever
    datatype used in some convenient language) contains an ASCII code or Unicode
    code point, and expects that the datatype instance will be equal to the
    ASCII code or Unicode code point.
    In that case, the assumption of such API is that you can compare the "char"
    instance for equality instead of comparing only the effective code points,
    and this greately simplifies the programmation.
    So an API that says that a "char" will contain ASCII code positions should
    always assume that only the instance values 0 to 127 will be used; same
    thing if an API says that an "int" contains an Unicode code point.

    The problem lives only in the usage of the same datatype to store also
    something else (even if it's just a parity bit or bit forced to 1).

    As long as this is not documented with the API itself, it should not be
    used, to preserve the rational assumption about identities of chars and
    identies of codes.

    So for me, a protocol that adds a parity bit to the ASCII code of a
    character is doing that on purpose, and this should be isolated in that
    documented part of its API. If the protocol wants to snd this data to an API
    or interface that does not document this use, it should remove/clear the
    extra bit, to make sure that the character identity is preserved and
    interpreted correctly (I can't see how such a protocol implementation can
    expect that a '@' character coded as 192 will be correctly interpreted by
    the other simpler interface that expects that all '@' instances will be
    equal to 64...)

    In safe programming, any unused field in a storage unit should be given a
    mandatory default. As the simplest form that perserves the code identity in
    ASCII or code point identity in Unicode is the one that use 0 as this
    default, extra bits should be cleared. If not, anything can appear within
    the recipient of the "character":

    - the recipient may interpret the value as something else than a character,
    behaving as if the characterdata was absent (so there will be data loss, in
    addition to unpected behavior). Bad practice, given that it is not
    documented in the recipient API or interface.

    - the recipient may interpret the value as another character, or may not
    recognize the expected character. It's not clearly a bad programming
    practice for recipients, because it is the simplest form of handling for
    them. However the recipient will not behave the way expected by the sender,
    and it is the sender's fault, not the recipient's fault.

    - the recipient may take additional unexpected actions in addition to the
    normal handling of the character without the extra bits. It would be a bad
    programming practive of recipients, if this specific behavior is not
    documented, so senders should not need to care about it.

    - the recipient may filter/ignore the value completely... resulting in data
    loss; this may be sometimes a good practice, but only if this recipient
    behavior is documented.

    - the recipient may filter/ignore the extra bits (for example by masking);
    for me it's a bad programming practice for recipients...

    - the recipient may substitute the incorrect value by another one (such as a
    SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an
    error, without changing the string length).

    - an exception may be raised (so the interface will fail) because the given
    value does belong to the expected ASCII code range or Unicode code point
    range (the safest practice for recipients, that are working under the
    "design by contract" model, is to check the domain value range of all its
    incoming data or parameters, to force the senders to obey the contract).

    Don't expect blindly that any interface capable of accepting ASCII codes in
    8-bit code units will also accept transparently all values outside of the
    restricted ASCII code range, unless this behavior is explicitly documenting
    how the character will be handled, and if this extension adds some
    equivalences (for example when the recipient will discard the extra bits)...

    The only safe way is then:
    - to send only values in the definition range of the standard encoding.
    - to not accept values out of this range, by raising a run-time exception.
    Run-time checking may sometimes be avoided in some languages that support
    value ranges in their datatype definitions; but this requires a new API with
    new explicitly restricted datatypes than the basic character datatype (the
    Character class in Java is such a datatype, whose constructor restricts
    acceptable values to the Unicode code point range 0..0x10FFFF)...
    - to create separate datatype definitions if one wants to pack more
    information in the same storage unit (for example by definining bitfield
    structures in C/C++, or by hiding this packing within the private
    implementation of the storage, not accessible directly without accessor
    methods, and not exposing these storage details to the published or public
    or protected interfaces), possibly with several constructors (only provided
    that the API can also be used to determine if an instance is a character or
    not), but with at least an API to retreive the original unique standard code
    from the instance.

    For C/C++ programs that use the native "char" datatype along with C strings,
    the only safe way is to NOT put anything else than the pure standard code in
    the instance value, so that one can effectively make sure that '@'==64 in an
    interface that is expected to receive ASCII characters.

    Same thing for Java which assumes that all "char" instances are regular
    UTF-16 code units (this is less a problem for UTF-16, because the whole
    16-bit code unit space is valid and has a normative behavior in Unicode,
    even for surrogate and non-character code units), or for C/C++ programs
    using 16-bit wide code units.

    For C/C++ programs that use the ANSI "wchar_t" datatype (which is not
    guaranteed to be 16-bit) no one should expect that extra bits that may exist
    on some platforms may be usable.

    For any language that use some fixed-width integer to store UTF-32 code
    units, the definition domain should be checked by recipients, or the
    recipient should document their behavior if other values are possible:

    Many applications will not only accept valid code points in 0..0x10FFFF, but
    also some "magic" values like -1 which have other meaning (such as the end
    of the input stream, or no other character available still). When this
    happens, the behavior is (or should be) documented explicitly, because the
    interface does not communicate only with valid characters.



    This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 12:30:20 CST