Re: Misuse of 8th bit [Was: My Querry]

From: Philippe Verdy ([email protected])
Date: Fri Nov 26 2004 - 09:23:38 CST

Next message: Philippe Verdy: "Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)"

Previous message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"
In reply to: Antoine Leca: "Re: Misuse of 8th bit [Was: My Querry]"
Next in thread: John Cowan: "Re: Misuse of 8th bit [Was: My Querry]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Antoine Leca" <[email protected]>
> On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
>>
>> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
>> the range 0 to 127. Nothing is defined outside of this range, exactly
>> like Unicode does not define or mandate anything for code points
>> larger than 0x10FFFF, should they be stored or handled in memory with
>> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
>> architecture or network framing constraints.
>> So the question of whever an application can or cannot use the extra
>> bits is left to the application, and this has no influence on the
>> standard charset encoding or on the encoding of Unicode itself.
>
> What you seem to miss here is that given computers are nowadays based on
> 8-bit units, there have been a strong move in the '80s and the '90s to
> _reserve_ ALL the 8 bits of the octet for characters. And what was asking
> A.
> Freitag was precisely to avoid bringing different ideas about
> possibilities
> to encode other class of informations inside the 8th bit of a ASCII-based
> storage of a character.

This is true for example in an API that just says that a "char" (or whatever
datatype used in some convenient language) contains an ASCII code or Unicode
code point, and expects that the datatype instance will be equal to the
ASCII code or Unicode code point.
In that case, the assumption of such API is that you can compare the "char"
instance for equality instead of comparing only the effective code points,
and this greately simplifies the programmation.
So an API that says that a "char" will contain ASCII code positions should
always assume that only the instance values 0 to 127 will be used; same
thing if an API says that an "int" contains an Unicode code point.

The problem lives only in the usage of the same datatype to store also
something else (even if it's just a parity bit or bit forced to 1).

As long as this is not documented with the API itself, it should not be
used, to preserve the rational assumption about identities of chars and
identies of codes.

So for me, a protocol that adds a parity bit to the ASCII code of a
character is doing that on purpose, and this should be isolated in that
documented part of its API. If the protocol wants to snd this data to an API
or interface that does not document this use, it should remove/clear the
extra bit, to make sure that the character identity is preserved and
interpreted correctly (I can't see how such a protocol implementation can
expect that a '@' character coded as 192 will be correctly interpreted by
the other simpler interface that expects that all '@' instances will be
equal to 64...)

In safe programming, any unused field in a storage unit should be given a
mandatory default. As the simplest form that perserves the code identity in
ASCII or code point identity in Unicode is the one that use 0 as this
default, extra bits should be cleared. If not, anything can appear within
the recipient of the "character":

- the recipient may interpret the value as something else than a character,
behaving as if the characterdata was absent (so there will be data loss, in
addition to unpected behavior). Bad practice, given that it is not
documented in the recipient API or interface.

- the recipient may interpret the value as another character, or may not
recognize the expected character. It's not clearly a bad programming
practice for recipients, because it is the simplest form of handling for
them. However the recipient will not behave the way expected by the sender,
and it is the sender's fault, not the recipient's fault.

- the recipient may take additional unexpected actions in addition to the
normal handling of the character without the extra bits. It would be a bad
programming practive of recipients, if this specific behavior is not
documented, so senders should not need to care about it.

- the recipient may filter/ignore the value completely... resulting in data
loss; this may be sometimes a good practice, but only if this recipient
behavior is documented.

- the recipient may filter/ignore the extra bits (for example by masking);
for me it's a bad programming practice for recipients...

- the recipient may substitute the incorrect value by another one (such as a
SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an
error, without changing the string length).

- an exception may be raised (so the interface will fail) because the given
value does belong to the expected ASCII code range or Unicode code point
range (the safest practice for recipients, that are working under the
"design by contract" model, is to check the domain value range of all its
incoming data or parameters, to force the senders to obey the contract).

Don't expect blindly that any interface capable of accepting ASCII codes in
8-bit code units will also accept transparently all values outside of the
restricted ASCII code range, unless this behavior is explicitly documenting
how the character will be handled, and if this extension adds some
equivalences (for example when the recipient will discard the extra bits)...

The only safe way is then:
- to send only values in the definition range of the standard encoding.
- to not accept values out of this range, by raising a run-time exception.
Run-time checking may sometimes be avoided in some languages that support
value ranges in their datatype definitions; but this requires a new API with
new explicitly restricted datatypes than the basic character datatype (the
Character class in Java is such a datatype, whose constructor restricts
acceptable values to the Unicode code point range 0..0x10FFFF)...
- to create separate datatype definitions if one wants to pack more
information in the same storage unit (for example by definining bitfield
structures in C/C++, or by hiding this packing within the private
implementation of the storage, not accessible directly without accessor
methods, and not exposing these storage details to the published or public
or protected interfaces), possibly with several constructors (only provided
that the API can also be used to determine if an instance is a character or
not), but with at least an API to retreive the original unique standard code
from the instance.

For C/C++ programs that use the native "char" datatype along with C strings,
the only safe way is to NOT put anything else than the pure standard code in
the instance value, so that one can effectively make sure that '@'==64 in an
interface that is expected to receive ASCII characters.

Same thing for Java which assumes that all "char" instances are regular
UTF-16 code units (this is less a problem for UTF-16, because the whole
16-bit code unit space is valid and has a normative behavior in Unicode,
even for surrogate and non-character code units), or for C/C++ programs
using 16-bit wide code units.

For C/C++ programs that use the ANSI "wchar_t" datatype (which is not
guaranteed to be 16-bit) no one should expect that extra bits that may exist
on some platforms may be usable.

For any language that use some fixed-width integer to store UTF-32 code
units, the definition domain should be checked by recipients, or the
recipient should document their behavior if other values are possible:

Many applications will not only accept valid code points in 0..0x10FFFF, but
also some "magic" values like -1 which have other meaning (such as the end
of the input stream, or no other character available still). When this
happens, the behavior is (or should be) documented explicitly, because the
interface does not communicate only with valid characters.

Next message: Philippe Verdy: "Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)"
Previous message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"
In reply to: Antoine Leca: "Re: Misuse of 8th bit [Was: My Querry]"
Next in thread: John Cowan: "Re: Misuse of 8th bit [Was: My Querry]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 26 2004 - 12:30:20 CST