Re: Charset vs. codeset

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Oct 03 1997 - 20:51:17 EDT


Yves asked:

>
> Here is a probably novice-type question for most of you, but so far (and
> I've been dealing with codeset/charset since years) I couldn't find a
> straight answer anywhere.
>
> What is the difference between a code set and a character set?
>
> For example, to me (so far)
> - Unicode is a character set (a "bag" of characters without code-point
> assigned)
> - UCS-2, UTF-8, etc. are code sets ([en]coded character sets, basically
> the implementation of the character set according a specific encoding
> scheme).
> - The general usage has a tendency to blend both.
>

A character repertoire is an unencoded bag of abstract characters.
   In IBM terminology, in particular, a character repertoire is known
   as a "character set".

A character encoding is a specification of numerical values for a particular
   character repertoire. In IBM terminology, in particular, a character
   encoding is known as a "code page".

Thus, for instance, Code Page 437 (the old IBM U.S. code page) is a
   character encoding; it encodes IBM Character Set 919, which is a
   character repertoire.

Unicode is an encoded character set.

ISO/IEC 10646 is an encoded character set.

Unicode and 10646 have exactly the same character repertoire and
   exactly the same encoding of the characters in that repertoire.

UCS-2, UCS-4, UTF-8, and UTF-16 are character encoding schemes (CES),
   specifying the actual byte usage for a particular form of use
   of the encoded characters from 10646.

UCS-2 and UCS-4 are also known specifically in 10646 as forms of use
   (and refer to whether a character's encoded value is expressed in
   16 bits or 32 bits).

UTF-8 and UTF-16 are also known specifically in 10646 as transformation
   formats.

Unicode, Version 2.0 is an encoded character set that has two
   sanctioned character encoding schemes: UTF-16 (16-bit Unicode
   with surrogates), and UTF-8.

A MIME "charset" is a mapping from byte (~octet) values to characters.
   Because different character encoding schemes change the actual byte
   values that are associated with characters, there has to be one
   MIME charset for UTF-8 and another MIME charset for UTF-16, even
   though both refer to the same encoded character set, namely Unicode.

The terms "code set" or "codeset" are so ambiguous that they are not
   generally in favor among character encoding specialists.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT