Re: Charsets + encoding + codesets

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Mon Oct 06 1997 - 17:24:21 EDT


Yves Savourel writes:

> Thanks for the various answers. Keld's paper was also very useful.
> Now two things seem to be clear for me:
>
> -- A "character set" has no code-points associated to each character.
> -- I should use the term "encoded character set" to name the
> implementation of a character set according a specific "encoding
> scheme".
>
> with this in mind I can't help but have still questions:
>
> -- If UNICODE is an "encoded character set" what is the name of the
> "character set" it implements? (UNICODE as well?). In other words, how
> should I call the character repertoire that UNICODE and 10646 encode?

You can both have a 10646 encoding and an 10646 repertoire.
The canonical encoding of 10646 is UCS-4. That means if you are not
more specific than saying "10646 coded character set" then you mean
UCS-4.

The trouble is that the "repertoire" of Unicode and 10646 is different.
10646 is clear on what is the repertoire: it is the characters of all
its code points. Unicode is clear on "abstract characters" that
you can make abstract characters by combining a number of characters
such as a base letter and then one or more combining accents.
But the combinations are not defined or limited, so for Unicode
you have an unlimited repertoire of Unicode abstract characters.

> -- In Ken's definitions the border between "encoding" and encoded
> character sets are not completely clear to me. I though cp47 would be an
> encoded character set. It also doesn't seems to correspond to Keld's
> definition of "encoding" in his paper that says: "encoding: the relation
> from the binary representation via coded character sets to (abstract)
> characters. The encoding defines the meaning of a binary data stream. It
> can consist of more than one coded character set, and an encoding scheme
> can be applied to regulate how these coded character sets are encoded.
> Also symbolic characters can be encoded in the encoding." If the
> definition is correct and the cp437 is an encoding then what are the
> encoded character set and the encoding scheme?

The encoding term I made was to cover for example the MIME "charset"
definition. "encoding" is also a handy term that just tells
everything about the characters are encoded (maybe that's why
MIME made it that way). It also corresponds roughly to
what POSIX can define in a "charmap".

You can have null encoding schemes and then the coded character set
is the encoding also.

So cp437 is the coded character set and it is also the encoding.
>
> Maybe a little table will illustrate better my puzzlement. It seems that
> we have to start a character set, we apply to it an encoding scheme and
> get a encoded character set. (Maybe I'm too simplistic?)
> Therefore we have something like this the following table. But, to me,
> there are missing pieces:
>
> [character set] [encoding scheme] [encoded character set]
> [encoding?]
> JIS-xxx EUC EUC-JA
> ? ?
> UNICODE
> IBM 919 cp437(?) ?(cp437?)
> ? UTF-8 ?
> ? UCS-2(?) ?(UCS-2?)
>
>
> But beyond that (and more importantly), now here is my real question:
>
> I'm working with several other people from various localization/tools
> vendors companies to set up a standard format for translation memory
> exchange (TMX). We use an XML-compliant format for this. One of the
> problem we run into is naming one of the attribute of some of the
> elements.
>
> That attribute specifies what "encoded character set" the original text
> was in (the text in TMX being always in Unicode, using ISO646 and
> character references for code-points above 128). Two terms proposed
> would be CODESET and CHARSET.
>
> Note that CHARSET is used in HTML, and according your various answer it
> should not, note also that the IANA page where the name of the
> "charsets/codesets" are listed (see
> ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets) names
> happily everything "character set" (including Unicode, UTF-8, UCS-2,
> Shift-JIS, etc.)

My proposal is "encoding" for the whole lot, or respetively "charset"
meaning the MIME concept.

> The values for that attributes will be Unicode (UCS-2), UTF-8, cp850,
> cp1252, Shift-JIS, EUC-JA, MacRoman, HPRoman8, etc. basically any (and
> more) of the "codesets/charsets" listed in the IANA page.

yes, why not just use the IANA names and their defs? I am trying to
get IETF and ISO align on terminology and definitions. I was involved
in writing MIME; and I am liaison officer from SC2 to IETF.
(I am not speaking with that hat on here, tho.)
Currently the concepts are well aligned, but the terminology is not.

> What attribute name should we use?
> CHARSET looks incorrect according your various answers (and I agree).
> CODESET seems to be not very in favor.
> ENCODING then? but some are "encoding schemes" (Keld makes a clear
> distinction between encoding and encoding scheme).

"encoding" is the current propoed term for handling this kind of
thing in ISO in the CD 14652 (where I am the editor) and also in
ISO WD 15435 (both available from the ISO i18n WG page
http://www.dkuug.dk/jtc1/sc22/wg20/). These specs define standards
APIs for handling character data and also standard formats to define
coded character sets/encodings.

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT