Re: Misuse of 8th bit [Was: My Querry]

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Nov 25 2004 - 15:46:51 CST

  • Next message: Doug Ewell: "Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Whever an application chooses to use the 8th (or even 9th...) bit of a
    > storage or memory or networking byte used also to store an ASCII-coded
    > character, as a zero, or as a even or odd parity bit, of for other
    > purpose is the choice of the application. It does not change the fact
    > that this (these) extra bit(s) is not used to code the character
    > itself. I see this usage as a data structure, that *contains* (I don't
    > say *is*) a character code. This completely out of topic of the ASCII
    > encoding itself which is only concerned by the codes assigned to
    > characters, and only characters.

    Unfortunately, although *we* understand this distinction, most people
    outside this list will not. And to make things worse, they will use
    language that only serves to blur the distinction.

    For example, the term "8-bit ASCII" was formerly used to mean an 8-bit
    byte that contained an ASCII character code in the bottom 7 bits, and
    where bit 7 (the MSB) might be:

    - always 0
    - always 1
    - odd or even parity

    depending on the implementation. (This was before the 1980s, when
    companies started populating code points 128 and beyond with "extended
    Latin" letters and other goodies, and calling *that* 8-bit ASCII.)

    Implementations would pass these 8-bit thingies around, bit 7 and all,
    and expect them to remain unscathed. Programs that emitted bit 7 = 1
    expected to receive bit 7 = 1. Those that emitted odd parity expected
    to receive odd parity. This was not just a data-interchange convention;
    many of these programs internally processed the byte as an atomic unit,
    parity bit and all. As John Cowan pointed out, on some systems the 8th
    bit was very much considered part of the "character," even though
    according to your model (which I do think makes sense) it is really a
    separate field within an 8-bit-wide data structure.

    > In ASCII, or in all other ISO 646 charsets, code positions are ALL in
    > the range 0 to 127. Nothing is defined outside of this range, exactly
    > like Unicode does not define or mandate anything for code points
    > larger than 0x10FFFF, should they be stored or handled in memory with
    > 21-, 24-, 32-, or 64-bit code units, more or less packed according to
    > architecture or network framing constraints.

    This is why it's perfectly legal to design your own TES or other
    structure for carrying Unicode (or even ASCII) code points. Inside your
    own black box, it doesn't matter what you do, as long as you don't
    corrupt data. But when communicating with the outside world, one needs
    to adhere to established standards.

    > Neither Unicode or US-ASCII or ISO 646 define what an application can
    > do there. The code positions or code points they define are *unique*
    > only in their *definition domain*. If you use larger domains for
    > values, nothing defines in Unicode or ISO 646 or ASCII how to
    > interpret the value: these standards will NOT assume that the low-
    > order bits can safely be used to index equivalent classes, because
    > these equivalence classes cannot be defined strictly within the
    > definition domain of these standard.

    What I think you are saying is this (and if so, I agree with it):

    If I want to design a 32-bit structure that contains a Unicode code
    point in 21 of the bits and something else in the remaining 11 -- or
    (more generally) uses values 0 through 0x10FFFF for Unicode characters
    and other values for something different -- I can do so. But I MUST NOT
    represent this as some sort of extension of Unicode, and I MUST adhere
    to all the conformance rules of Unicode inasmuch as they relate to the
    part of my structure that purports to represent a code point. And I
    SHOULD be very careful about passing these around to the outside world,
    lest someone get the wrong impression. Same for ASCII.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Nov 25 2004 - 15:49:59 CST