Re: Charsets + encoding + codesets

From: Kenneth Whistler (
Date: Mon Oct 06 1997 - 15:16:01 EDT

Yves asked some follow-up questions:

> with this in mind I can't help but have still questions:
> -- If UNICODE is an "encoded character set" what is the name of the
> "character set" it implements? (UNICODE as well?). In other words, how
> should I call the character repertoire that UNICODE and 10646 encode?

The character repertoire for Unicode and ISO/IEC 10646 is the
"Universal Character Set", or UCS for short. Unlike most other
encoded character sets, which in principle, at least, start with a limited
set of characters, the stated goal of Unicode/10646 is to serve as a
universal encoding of all characters required for information technology.

The content of the UCS (the enumeration of the members of the set, to
use Keld's more mathematical approach) is a moving target. Each
amendment to 10646 which encodes more characters also expands the
repertoire officially covered by the standard. Each publication of
a new version of the Unicode Standard likewise expands the repertoire,
since many new characters are added to the overall encoded set.

> -- In Ken's definitions the border between "encoding" and encoded
> character sets are not completely clear to me. I though cp47 would be an
> encoded character set. It also doesn't seems to correspond to Keld's
> definition of "encoding" in his paper that says: "encoding: the relation
> from the binary representation via coded character sets to (abstract)
> characters. The encoding defines the meaning of a binary data stream. It
> can consist of more than one coded character set, and an encoding scheme
> can be applied to regulate how these coded character sets are encoded.
> Also symbolic characters can be encoded in the encoding." If the
> definition is correct and the cp437 is an encoding then what are the
> encoded character set and the encoding scheme?

cp437 is an encoded character set.

"Encoding" actually has several meanings.

1. The process of assigning numerical values to characters. This is what
   character standards committees do.

     "The job of WG2 is encoding the universal character set."

2. The numerical value associated with an encoded characters. This is a
   synonym for "encoded value", "code value", "codepoint", etc.

     "The encoding for the yen sign is U+00A5."

3. An entire encoded character set. This is a synonym for "encoded character
   set", or "code page".

     "The encoding we used for that data was cp437."

4. The mathematical relation (a unique and symmetric mapping function) between a
   character repertoire and coded representations. This is synonymous with the
   term "coded character set" as defined in 10646: "A set of unambiguous
   rules that establishes a character set and the relationship between
   the characters of the set and their coded representations." [The important
   thing here is that each character is associated with a number, and each
   numerical value is unambiguously related to a character.]

      "The Unicode Standard uses a 16-bit encoding for characters."

5. The mathematical relation (non-unique and asymmetric) between bit values
   used in character data representation for information interchange and
   the characters that data represents. This is synonymous with the term
   "character encoding scheme" as used by the Internet Architecture Board.
   It also seems to be what Keld is defining above. [The important
   distinction is that for some encoding schemes, such as ISO 2022, the
   relation between any particular sequence of bits and characters may
   be non-unique in both directions. The "encoding" in this sense defines
   how to get from the bits to the characters, but not necessarily the

       "Each different encoding requires registration of a different
        MIME charset."

> Maybe a little table will illustrate better my puzzlement. It seems that
> we have to start a character set, we apply to it an encoding scheme and
> get a encoded character set. (Maybe I'm too simplistic?)

Actually, the way I conceive of this is as follows:

A. Define a repertoire (= character set) to be encoded.
B. Map the repertoire to numbers. (I.e. "encode" the character set.)
C. Define how the numbers are to appear and be used in actual streams of
   bytes in data interchange (or bit streams, if you prefer). This is
   the application of an encoding scheme to the set of numbers that
   were defined in step B.

For many people, the distinction between step B and C is confusing, since
for so long the association between "character" and (8-bit) "byte" has
been so rigid that assignment of a coded value for a character seemed
as if it also specified the data representation. But if you think about
Asian "DBCS" encoded character sets, this has never been the case--and
the distinction between the encoding and the encoding scheme(s) for
JIS, for example, is much clearer.

> Therefore we have something like this the following table. But, to me,
> there are missing pieces:
> [character set] [encoding scheme] [encoded character set]
> [encoding?]
> ? ?
> IBM 919 cp437(?) ?(cp437?)
> ? UTF-8 ?
> ? UCS-2(?) ?(UCS-2?)

Maybe the easiest way to clarify this is to quote from some mail
I sent out privately a few months ago regarding a better way to specify
a character set registry.

<start quote>

A form I would like to see a consistent registry expressed in would
include the following information:

Standard(s)/PAS Repertoire CCS CES Short Tag Etc.

Where the first 5 fields are *all* obligatory for a encoded character
set entry to be complete. e.g.:

ISO 8859-1:1987 Latin-1 8859-1 8bit-I iso8859-1
ISO 10646-1:1993 UCS 10646 UCS-4 ucs4
ISO 10646-1:1993, USV2 UCS 10646 UTF-8 utf8
ISO 10646-1:1993, USV2 UCS 10646 UTF-16 utf16
ISO 10646-1:1993 BMP 10646 UCS-2 ucs2
CDRA (CCSID 00437) CS01212 CP437 8bit cp437
CDRA (CCSID 00850) CS01106 CP850 8bit cp850
CDRA (CCSID 00037) CS00697 CP037 8bit-E cp037
CDRA (CCSID 00938) CS00103+ CP904+ DBCS-M cp938
                        CS00935 CP927
CDRA (CCSID 00937) CS01175+ CP037+ SISO cp937
                        CS00935 CP835
Mac OS Cyrillic Mac Cyrillic MacCyr 8bit mac-cyr
Microsoft tables JIS X-0201+ CP932 DBCS-M cp932ms
                        JIS X-0208+
                        IBM extensions+
                        MS extensions

and so on. The etc. columns would contain all the other useful
information about the coded character set (its usage, and all the
various crossmappings to vendor and ISO and InterNet id's, etc.)

Note that unlike the IANA registry, which seems to treat the CES as
a sometime thing, I consider it an essential attribute of *every*
encoded character set. The terms I use above include:

   8bit single character mapped to single byte, no C0/C1 restrictions
   8bit-I single character mapped to single byte, ISO C0/C1 restrictions
   8bit-E single character mapped to single byte, EBCDIC structure
   DBCS-M shift-based DBCS mapping
   SISO SI/SO run-encoded DBCS switching for EBCDIC hosts

And of course there are the 7bit CES's and the various other
EUC and 2022-based schemes, etc.

<end quote>

Some other terminology clarifications:
PAS: Publicly Available Specification
USV2: The Unicode Standard, Version 2.0
BMP: The Basic Multilingual Plane
CDRA: Character Data Representation Architecture (IBM's standard)
CCSID: Coded Character Set Identifier
CCS: Coded Character Set
CES: Character Encoding Scheme

Note that some encoded character sets have complicated definitions of
their repertoires and/or their encodings. For example, the IBM PC
and Host Japanese encodings both contain a two-part repertoire and
a two-part encoding (mixed single-byte and double-byte code pages).

> But beyond that (and more importantly), now here is my real question:
> I'm working with several other people from various localization/tools
> vendors companies to set up a standard format for translation memory
> exchange (TMX). We use an XML-compliant format for this. One of the
> problem we run into is naming one of the attribute of some of the
> elements.
> That attribute specifies what "encoded character set" the original text
> was in (the text in TMX being always in Unicode, using ISO646 and
> character references for code-points above 128). Two terms proposed
> would be CODESET and CHARSET.
> Note that CHARSET is used in HTML, and according your various answer it
> should not, note also that the IANA page where the name of the
> "charsets/codesets" are listed (see
> names
> happily everything "character set" (including Unicode, UTF-8, UCS-2,
> Shift-JIS, etc.)

It absolutely is a horrible hodge-podge. The maintainers of the
IANA registry are trying to tighten it up, but as Martin pointed
out, there already are all kinds of useless entities registered
as "character sets".

> The values for that attributes will be Unicode (UCS-2), UTF-8, cp850,
> cp1252, Shift-JIS, EUC-JA, MacRoman, HPRoman8, etc. basically any (and
> more) of the "codesets/charsets" listed in the IANA page.

Actually, the values for the attributes should be conceived of as
an entry in the table I have shown above. The Short Tag should be
a unique identifier for that entry. Once you have matched an entry
in the table as defined above, both the character encoding and the encoding
scheme are unambiguously identified. This enables conversion
from data representation (the byte stream) to characters, which
is what the processing code needs in order to interpret the data.

> What attribute name should we use?
> CHARSET looks incorrect according your various answers (and I agree).
> CODESET seems to be not very in favor.
> ENCODING then? but some are "encoding schemes" (Keld makes a clear
> distinction between encoding and encoding scheme).

Actually, "CHARSET" is probably your best choice, following HTML, I would
think. In any case, you should do whatever is best practice in XML.

Any XML experts care to comment?


> Any suggestions would be immensely appreciated.
> Thanks.
> --Yves

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT