Re: Charset vs. codeset

From: Kenneth Whistler ([email protected])
Date: Fri Oct 03 1997 - 20:51:17 EDT

Next message: Keld J|rn Simonsen: "Re: Charset vs. codeset"
Previous message: Yves Savourel: "Charset vs. codeset"
Maybe in reply to: Yves Savourel: "Charset vs. codeset"
Next in thread: Keld J|rn Simonsen: "Re: Charset vs. codeset"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Yves asked:

>
> Here is a probably novice-type question for most of you, but so far (and
> I've been dealing with codeset/charset since years) I couldn't find a
> straight answer anywhere.
>
> What is the difference between a code set and a character set?
>
> For example, to me (so far)
> - Unicode is a character set (a "bag" of characters without code-point
> assigned)
> - UCS-2, UTF-8, etc. are code sets ([en]coded character sets, basically
> the implementation of the character set according a specific encoding
> scheme).
> - The general usage has a tendency to blend both.
>

A character repertoire is an unencoded bag of abstract characters.
In IBM terminology, in particular, a character repertoire is known
as a "character set".

A character encoding is a specification of numerical values for a particular
character repertoire. In IBM terminology, in particular, a character
encoding is known as a "code page".

Thus, for instance, Code Page 437 (the old IBM U.S. code page) is a
character encoding; it encodes IBM Character Set 919, which is a
character repertoire.

Unicode is an encoded character set.

ISO/IEC 10646 is an encoded character set.

Unicode and 10646 have exactly the same character repertoire and
exactly the same encoding of the characters in that repertoire.

UCS-2, UCS-4, UTF-8, and UTF-16 are character encoding schemes (CES),
specifying the actual byte usage for a particular form of use
of the encoded characters from 10646.

UCS-2 and UCS-4 are also known specifically in 10646 as forms of use
(and refer to whether a character's encoded value is expressed in
16 bits or 32 bits).

UTF-8 and UTF-16 are also known specifically in 10646 as transformation
formats.

Unicode, Version 2.0 is an encoded character set that has two
sanctioned character encoding schemes: UTF-16 (16-bit Unicode
with surrogates), and UTF-8.

A MIME "charset" is a mapping from byte (~octet) values to characters.
   Because different character encoding schemes change the actual byte
   values that are associated with characters, there has to be one
   MIME charset for UTF-8 and another MIME charset for UTF-16, even
   though both refer to the same encoded character set, namely Unicode.

The terms "code set" or "codeset" are so ambiguous that they are not
generally in favor among character encoding specialists.

--Ken Whistler

Next message: Keld J|rn Simonsen: "Re: Charset vs. codeset"
Previous message: Yves Savourel: "Charset vs. codeset"
Maybe in reply to: Yves Savourel: "Charset vs. codeset"
Next in thread: Keld J|rn Simonsen: "Re: Charset vs. codeset"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT