Re: Chapter on character sets

From: Keld Jørn Simonsen (keld@dkuug.dk)
Date: Thu Jun 15 2000 - 21:38:24 EDT


On Thu, Jun 15, 2000 at 06:57:25PM -0600, Mike Brown wrote:
> Warning: this email is full of opinions.
>
> Keld Jørn Simonsen wrote:
>
> > 8859-1 also defines 20-7E.
>
> Argh. I thought so, but someone responding to an earlier draft on this list
> a few months ago said 8859-1 only defined A0-FF.
>
> Thank you for all the clarifications.
>
> It is frustratingly difficult to find this kind of definitive information
> when the Internet RFCs refer to expensive ISO publications. It would seem to
> undermine the intent of standardization, especially on the Internet, to hold
> the standards for ransom. It is as if the IETF says "We think everybody
> should be following these standards on the Internet. If you want a copy of
> the standards please send hundreds of dollars to Switzerland and someone
> will mail you a set of paperweights." Ridiculous.

I agree with you. And furthermore I recently found out that the
the surplus on sales on IT standards out of ISO and its
member bodies is less than USD 1 mill a year.

Furthermore actually IETF has requested that ISO character set
standards be freely available on the net, and ISO is processing
this request, and has gone quite far with approving this, so
it should happen eventually. I believe JTC1 has approved this,
but it is still to be approved at the highest levels of ISO and IEC.

Anyway drafts of most of the ISO character set standards
are freely available from the SC2 site at http://www.dkuug.dk/jtc1/sc2/

Also the ISO 2375 character set registry is freely available -
should be obtainable via http://www.dkuug.dk/jtc1/sc2/wg3/
>
> >> among other things, it [ISO/IEC 10646-1] introduces a
> >> distinction between the assignment of characters to numbers,
> >> and the conversion of numbers to sequences of bytes or
> >> other fixed-bit-width code values.
> >
> > There are a number of character sets, quite old, that had 2 bytes
> > per character
> >
> > 10646 is not defined like you describe it, the distinction is
> > not described in 10646.
>
> I didn't think it was, but I am not in disagreement, either. What I am
> saying is without actually seeing the ISO standards, as far as I can tell,
> ISO 646 and its national variants, and even 10646-1/Unicode, define a
> mapping of abstract characters to sequences of fixed bit-width code values.

this is not true for the 6937 family og charsets, or for
charsets like shift-jis - which are variable length encodings.

> The sequences are usually sequences of 1, but as you say, sometimes 2, and
> the code values are usually 8-bit bytes, but sometimes 7-bit or other
> uncommon widths.
>
> The concept of intermediary scalar values ("these characters have numeric
> values, the values can be encoded in multiple ways") may have existed, but
> seems to have not been acknowledged as being significant until 10646/Unicode
> and UTFs came about.

Well, I am not so sure this is a productive concept, or even true.
It does not explain easily eg 6937 sets or shift-jis encodings, IMHO.
>
> It would be infinitely easier to explain and understand if the standards
> assigned characters directly to scalar values and then provided the encoding
> forms and schemes as the means of codifying and the values into
> computer-friendly code value sequences. It seems rather convoluted for the
> primary assignments to be made to UTF-16 code value sequences, and
> mentioning scalar values almost as an afterthought. "Oh yeah, you can derive
> scalar values from these code value sequences. *snort*"

I agree with you there.
>
> > I belive UTF-8 is 1 to 6 octets.
>
> UTF-8 as an algorithm applied to values in a 32-bit code space allows for up
> to 6 octets, yes, but as an algorithm for use with the UCS, only 4, at most,
> are needed, now that 0x10FFFF is the maximum scalar value. That's why I said
> 4.

Used on UCS-4 it is up to 6 octets, and this is the definition
of UTF-8 both in ISO 10646 and in the RFC, so
if you are referencing the ISO standard or
the RFC I think it is a misrepresentation to say only 4.
I know that Unicode defines it as only 4 octets, tho.

Kind regards
Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT