RE: Chapter on character sets

From: Mike Brown (mbrown@corp.webb.net)
Date: Thu Jun 15 2000 - 20:57:25 EDT


Warning: this email is full of opinions.

Keld Jørn Simonsen wrote:
> AFAIK the 1968 version of ASCII also included control
> characters 00-1f and 7F.
>
> ISO 646 did not include national variants , but provided for it.
> 12 positions were unassigned. Then a number of national bodies
> made their national variant
>
>> some leeway is allowed for currency symbols: hex position 23
>> can be # or £, and 24 can be $ or €.
>
> Yes 646 had the positions 23 and 24 as you describe

Ah, okay. I get it now.

> 8859-1 also defines 20-7E.

Argh. I thought so, but someone responding to an earlier draft on this list
a few months ago said 8859-1 only defined A0-FF.

Thank you for all the clarifications.

It is frustratingly difficult to find this kind of definitive information
when the Internet RFCs refer to expensive ISO publications. It would seem to
undermine the intent of standardization, especially on the Internet, to hold
the standards for ransom. It is as if the IETF says "We think everybody
should be following these standards on the Internet. If you want a copy of
the standards please send hundreds of dollars to Switzerland and someone
will mail you a set of paperweights." Ridiculous.

>> among other things, it [ISO/IEC 10646-1] introduces a
>> distinction between the assignment of characters to numbers,
>> and the conversion of numbers to sequences of bytes or
>> other fixed-bit-width code values.
>
> There are a number of character sets, quite old, that had 2 bytes
> per character
>
> 10646 is not defined like you describe it, the distinction is
> not described in 10646.

I didn't think it was, but I am not in disagreement, either. What I am
saying is without actually seeing the ISO standards, as far as I can tell,
ISO 646 and its national variants, and even 10646-1/Unicode, define a
mapping of abstract characters to sequences of fixed bit-width code values.
The sequences are usually sequences of 1, but as you say, sometimes 2, and
the code values are usually 8-bit bytes, but sometimes 7-bit or other
uncommon widths.

The concept of intermediary scalar values ("these characters have numeric
values, the values can be encoded in multiple ways") may have existed, but
seems to have not been acknowledged as being significant until 10646/Unicode
and UTFs came about.

It would be infinitely easier to explain and understand if the standards
assigned characters directly to scalar values and then provided the encoding
forms and schemes as the means of codifying and the values into
computer-friendly code value sequences. It seems rather convoluted for the
primary assignments to be made to UTF-16 code value sequences, and
mentioning scalar values almost as an afterthought. "Oh yeah, you can derive
scalar values from these code value sequences. *snort*"

> I belive UTF-8 is 1 to 6 octets.

UTF-8 as an algorithm applied to values in a 32-bit code space allows for up
to 6 octets, yes, but as an algorithm for use with the UCS, only 4, at most,
are needed, now that 0x10FFFF is the maximum scalar value. That's why I said
4.

-Mike



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT