Re: (TC304WG4.50) Charset vs. codeset

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Sat Oct 04 1997 - 11:47:32 EDT


I had a few comments to Kenneth Whistler's recent writing:

> Yves asked:
>
> >
> > Here is a probably novice-type question for most of you, but so far (and
> > I've been dealing with codeset/charset since years) I couldn't find a
> > straight answer anywhere.
> >
> > What is the difference between a code set and a character set?
> >
> > For example, to me (so far)
> > - Unicode is a character set (a "bag" of characters without code-point
> > assigned)
> > - UCS-2, UTF-8, etc. are code sets ([en]coded character sets, basically
> > the implementation of the character set according a specific encoding
> > scheme).
> > - The general usage has a tendency to blend both.
> >
>
> A character repertoire is an unencoded bag of abstract characters.
> In IBM terminology, in particular, a character repertoire is known
> as a "character set".

Well, a character repertoire is not a "bag", but a set.
the difference is that you only have an item once in a set, but
you can have several ocurrances of the same item in a bag.
Consider the normal meaning of "repertoire", what is
on the repertoire of a singer, that does not list the same
song twice. Also consider the term "character set" which is also
in ISO terminology equivalent to the term "repertoire", "set" is
here equivalent to the matematical term.

> A character encoding is a specification of numerical values for a particular
> character repertoire. In IBM terminology, in particular, a character
> encoding is known as a "code page".
>
> Thus, for instance, Code Page 437 (the old IBM U.S. code page) is a
> character encoding; it encodes IBM Character Set 919, which is a
> character repertoire.
>
> Unicode is an encoded character set.

I am not so sure about that. It violates the general principles of
that an encoded character set only encodes one (abstract) character
in one way.
>
> ISO/IEC 10646 is an encoded character set.

True.

> UCS-2, UCS-4, UTF-8, and UTF-16 are character encoding schemes (CES),
> specifying the actual byte usage for a particular form of use
> of the encoded characters from 10646.

I would only say that for UTF-8 and UTF-16. UCS-2 and UCS-4 are
coded character sets, not encoding schemes.

> A MIME "charset" is a mapping from byte (~octet) values to characters.
> Because different character encoding schemes change the actual byte
> values that are associated with characters, there has to be one
> MIME charset for UTF-8 and another MIME charset for UTF-16, even
> though both refer to the same encoded character set, namely Unicode.

Well technically MIME definitions refer to 10646.

>
> The terms "code set" or "codeset" are so ambiguous that they are not
> generally in favor among character encoding specialists.

Agree. But I actually prefer them over "character set".

I have proposed a new work item in SC2 to clarify this terminology
in ISO and the relations between the terms.

Keld Simonsen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT