Re: Charsets + encoding + codesets

From: Keld J|rn Simonsen (
Date: Mon Oct 06 1997 - 17:24:21 EDT

Yves Savourel writes:

> Thanks for the various answers. Keld's paper was also very useful.
> Now two things seem to be clear for me:
> -- A "character set" has no code-points associated to each character.
> -- I should use the term "encoded character set" to name the
> implementation of a character set according a specific "encoding
> scheme".
> with this in mind I can't help but have still questions:
> -- If UNICODE is an "encoded character set" what is the name of the
> "character set" it implements? (UNICODE as well?). In other words, how
> should I call the character repertoire that UNICODE and 10646 encode?

You can both have a 10646 encoding and an 10646 repertoire.
The canonical encoding of 10646 is UCS-4. That means if you are not
more specific than saying "10646 coded character set" then you mean

The trouble is that the "repertoire" of Unicode and 10646 is different.
10646 is clear on what is the repertoire: it is the characters of all
its code points. Unicode is clear on "abstract characters" that
you can make abstract characters by combining a number of characters
such as a base letter and then one or more combining accents.
But the combinations are not defined or limited, so for Unicode
you have an unlimited repertoire of Unicode abstract characters.

> -- In Ken's definitions the border between "encoding" and encoded
> character sets are not completely clear to me. I though cp47 would be an
> encoded character set. It also doesn't seems to correspond to Keld's
> definition of "encoding" in his paper that says: "encoding: the relation
> from the binary representation via coded character sets to (abstract)
> characters. The encoding defines the meaning of a binary data stream. It
> can consist of more than one coded character set, and an encoding scheme
> can be applied to regulate how these coded character sets are encoded.
> Also symbolic characters can be encoded in the encoding." If the
> definition is correct and the cp437 is an encoding then what are the
> encoded character set and the encoding scheme?

The encoding term I made was to cover for example the MIME "charset"
definition. "encoding" is also a handy term that just tells
everything about the characters are encoded (maybe that's why
MIME made it that way). It also corresponds roughly to
what POSIX can define in a "charmap".

You can have null encoding schemes and then the coded character set
is the encoding also.

So cp437 is the coded character set and it is also the encoding.
> Maybe a little table will illustrate better my puzzlement. It seems that
> we have to start a character set, we apply to it an encoding scheme and
> get a encoded character set. (Maybe I'm too simplistic?)
> Therefore we have something like this the following table. But, to me,
> there are missing pieces:
> [character set] [encoding scheme] [encoded character set]
> [encoding?]
> ? ?
> IBM 919 cp437(?) ?(cp437?)
> ? UTF-8 ?
> ? UCS-2(?) ?(UCS-2?)
> But beyond that (and more importantly), now here is my real question:
> I'm working with several other people from various localization/tools
> vendors companies to set up a standard format for translation memory
> exchange (TMX). We use an XML-compliant format for this. One of the
> problem we run into is naming one of the attribute of some of the
> elements.
> That attribute specifies what "encoded character set" the original text
> was in (the text in TMX being always in Unicode, using ISO646 and
> character references for code-points above 128). Two terms proposed
> would be CODESET and CHARSET.
> Note that CHARSET is used in HTML, and according your various answer it
> should not, note also that the IANA page where the name of the
> "charsets/codesets" are listed (see
> names
> happily everything "character set" (including Unicode, UTF-8, UCS-2,
> Shift-JIS, etc.)

My proposal is "encoding" for the whole lot, or respetively "charset"
meaning the MIME concept.

> The values for that attributes will be Unicode (UCS-2), UTF-8, cp850,
> cp1252, Shift-JIS, EUC-JA, MacRoman, HPRoman8, etc. basically any (and
> more) of the "codesets/charsets" listed in the IANA page.

yes, why not just use the IANA names and their defs? I am trying to
get IETF and ISO align on terminology and definitions. I was involved
in writing MIME; and I am liaison officer from SC2 to IETF.
(I am not speaking with that hat on here, tho.)
Currently the concepts are well aligned, but the terminology is not.

> What attribute name should we use?
> CHARSET looks incorrect according your various answers (and I agree).
> CODESET seems to be not very in favor.
> ENCODING then? but some are "encoding schemes" (Keld makes a clear
> distinction between encoding and encoding scheme).

"encoding" is the current propoed term for handling this kind of
thing in ISO in the CD 14652 (where I am the editor) and also in
ISO WD 15435 (both available from the ISO i18n WG page These specs define standards
APIs for handling character data and also standard formats to define
coded character sets/encodings.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT