From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Nov 25 2004 - 14:05:18 CST
From: "Antoine Leca" <Antoine10646@leca-marti.org>
> On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
>>
>> I'm not seeing a lot in this thread that adds to the store of
>> knowledge on this issue, but I see a number of statements that are
>> easily misconstrued or misapplied, including the thoroughly
>> discredited practice of storing information in the high
>> bit, when piping seven-bit data through eight-bit pathways. The
>> problem with that approach, of course, is that the assumption
>> that there were never going to be 8-bit data in these same pipes
>> proved fatally wrong.
>
> Since I was the person who did introduce this theme into the thread, I
> feel
> there is an important point that should be highlighted here. The "widely
> discredited practice of storing information in the high bit" is in fact
> like
> the Y2K problem, a bad consequence of past practices. Only difference is
> that we do not have a hard time limit to solve it.
Whever an application chooses to use the 8th (or even 9th...) bit of a
storage or memory or networking byte used also to store an ASCII-coded
character, as a zero, or as a even or odd parity bit, of for other purpose
is the choice of the application. It does not change the fact that this
(these) extra bit(s) is not used to code the character itself.
I see this usage as a data structure, that *contains* (I don't say *is*) a
character code. This completely out of topic of the ASCII encoding itself
which is only concerned by the codes assigned to characters, and only
characters.
In ASCII, or in all other ISO 646 charsets, code positions are ALL in the
range 0 to 127. Nothing is defined outside of this range, exactly like
Unicode does not define or mandate anything for code points larger than
0x10FFFF, should they be stored or handled in memory with 21-, 24-, 32-, or
64-bit code units, more or less packed according to architecture or network
framing constraints.
So the question of whever an application can or cannot use the extra bits is
left to the application, and this has no influence on the standard charset
encoding or on the encoding of Unicode itself.
So a good question to ask is how to handle values of variables or instances,
that are supposed to contain a character code, but whose internal storage
can make values out of the designed range fit in the storage code unit. For
me it is left to the application, but many applications will simply assume
that such a datatype is made to accept a unique code per designated
character. Using the extra storage bits for something else will break this
legitimate assumption, and so applications must be prepared specially to
handle this case, by filtering values before checking for character
identity.
Neither Unicode or US-ASCII or ISO 646 define what an application can do
there. The code positions or code points they define are *unique* only in
their *definition domain*. If you use larger domains for values, nothing
defines in Unicode or ISO 646 or ASCII how to interpret the value: these
standards will NOT assume that the low-order bits can safely be used to
index equivalent classes, because these equivalence classes cannot be
defined strictly within the definition domain of these standard.
So I see no valid rationale behind requiring applications to clear the extra
bits, or to leave the extra bits unaffected, or to force these applications
to necessarily interpreting the low order bits as valid code points.
We are out of the definition domain, so any larger domain is
application-specific, and applications may as well use ASCII or Unicode
within storage code units which add some offsets, or multiply the standard
codes by a constant, or apply a reordering transformation (permutation) on
them and other possible non-character values.
When ASCII and ISO 646 in general define a charset with 128 unique code
positions, they don't say how this information will be stored (an
application may as well need to use 7 distinct bytes (or other
structures...), not necessarily consecutive, to *represent* the unique codes
that represent ASCII or ISO 646 characters), and they don't restrict the
usage of these codes separately of any other independant information (such
as parity bits, or anything else). Any storage structure that allows keeping
the identity and equivalences of the original standard code in its
definition domain is equally valid as a representation of the standard, but
this structure is out of scope of the charset definition.
This archive was generated by hypermail 2.1.5 : Thu Nov 25 2004 - 14:09:55 CST