Re: Misuse of 8th bit [Was: My Querry]

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Nov 25 2004 - 14:05:18 CST

  • Next message: Philippe Verdy: "Re: Shift-JIS conversion."

    From: "Antoine Leca" <Antoine10646@leca-marti.org>
    > On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
    >>
    >> I'm not seeing a lot in this thread that adds to the store of
    >> knowledge on this issue, but I see a number of statements that are
    >> easily misconstrued or misapplied, including the thoroughly
    >> discredited practice of storing information in the high
    >> bit, when piping seven-bit data through eight-bit pathways. The
    >> problem with that approach, of course, is that the assumption
    >> that there were never going to be 8-bit data in these same pipes
    >> proved fatally wrong.
    >
    > Since I was the person who did introduce this theme into the thread, I
    > feel
    > there is an important point that should be highlighted here. The "widely
    > discredited practice of storing information in the high bit" is in fact
    > like
    > the Y2K problem, a bad consequence of past practices. Only difference is
    > that we do not have a hard time limit to solve it.

    Whever an application chooses to use the 8th (or even 9th...) bit of a
    storage or memory or networking byte used also to store an ASCII-coded
    character, as a zero, or as a even or odd parity bit, of for other purpose
    is the choice of the application. It does not change the fact that this
    (these) extra bit(s) is not used to code the character itself.
    I see this usage as a data structure, that *contains* (I don't say *is*) a
    character code. This completely out of topic of the ASCII encoding itself
    which is only concerned by the codes assigned to characters, and only
    characters.
    In ASCII, or in all other ISO 646 charsets, code positions are ALL in the
    range 0 to 127. Nothing is defined outside of this range, exactly like
    Unicode does not define or mandate anything for code points larger than
    0x10FFFF, should they be stored or handled in memory with 21-, 24-, 32-, or
    64-bit code units, more or less packed according to architecture or network
    framing constraints.
    So the question of whever an application can or cannot use the extra bits is
    left to the application, and this has no influence on the standard charset
    encoding or on the encoding of Unicode itself.

    So a good question to ask is how to handle values of variables or instances,
    that are supposed to contain a character code, but whose internal storage
    can make values out of the designed range fit in the storage code unit. For
    me it is left to the application, but many applications will simply assume
    that such a datatype is made to accept a unique code per designated
    character. Using the extra storage bits for something else will break this
    legitimate assumption, and so applications must be prepared specially to
    handle this case, by filtering values before checking for character
    identity.

    Neither Unicode or US-ASCII or ISO 646 define what an application can do
    there. The code positions or code points they define are *unique* only in
    their *definition domain*. If you use larger domains for values, nothing
    defines in Unicode or ISO 646 or ASCII how to interpret the value: these
    standards will NOT assume that the low-order bits can safely be used to
    index equivalent classes, because these equivalence classes cannot be
    defined strictly within the definition domain of these standard.

    So I see no valid rationale behind requiring applications to clear the extra
    bits, or to leave the extra bits unaffected, or to force these applications
    to necessarily interpreting the low order bits as valid code points.
    We are out of the definition domain, so any larger domain is
    application-specific, and applications may as well use ASCII or Unicode
    within storage code units which add some offsets, or multiply the standard
    codes by a constant, or apply a reordering transformation (permutation) on
    them and other possible non-character values.

    When ASCII and ISO 646 in general define a charset with 128 unique code
    positions, they don't say how this information will be stored (an
    application may as well need to use 7 distinct bytes (or other
    structures...), not necessarily consecutive, to *represent* the unique codes
    that represent ASCII or ISO 646 characters), and they don't restrict the
    usage of these codes separately of any other independant information (such
    as parity bits, or anything else). Any storage structure that allows keeping
    the identity and equivalences of the original standard code in its
    definition domain is equally valid as a representation of the standard, but
    this structure is out of scope of the charset definition.



    This archive was generated by hypermail 2.1.5 : Thu Nov 25 2004 - 14:09:55 CST