Re: Charsets + encoding + codesets

From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Wed Oct 08 1997 - 08:37:16 EDT


On Mon, 6 Oct 1997, Kenneth Whistler wrote:

> Yves asked some follow-up questions:

> 2. The numerical value associated with an encoded characters. This is a
> synonym for "encoded value", "code value", "codepoint", etc.
>
> "The encoding for the yen sign is U+00A5."

In my understanding, 'codepoint' refers to both the number and the
character 'encoded' at that number.

> Maybe the easiest way to clarify this is to quote from some mail
> I sent out privately a few months ago regarding a better way to specify
> a character set registry.
>
> <start quote>
>
> A form I would like to see a consistent registry expressed in would
> include the following information:
>
> Standard(s)/PAS Repertoire CCS CES Short Tag Etc.
>
> Where the first 5 fields are *all* obligatory for a encoded character
> set entry to be complete. e.g.:
>
> ISO 8859-1:1987 Latin-1 8859-1 8bit-I iso8859-1
> ISO 10646-1:1993 UCS 10646 UCS-4 ucs4
> ISO 10646-1:1993, USV2 UCS 10646 UTF-8 utf8
> ISO 10646-1:1993, USV2 UCS 10646 UTF-16 utf16
> ISO 10646-1:1993 BMP 10646 UCS-2 ucs2
> CDRA (CCSID 00437) CS01212 CP437 8bit cp437
> CDRA (CCSID 00850) CS01106 CP850 8bit cp850
> CDRA (CCSID 00037) CS00697 CP037 8bit-E cp037
> CDRA (CCSID 00938) CS00103+ CP904+ DBCS-M cp938
> CS00935 CP927
> CDRA (CCSID 00937) CS01175+ CP037+ SISO cp937
> CS00935 CP835
> Mac OS Cyrillic Mac Cyrillic MacCyr 8bit mac-cyr
> Microsoft tables JIS X-0201+ CP932 DBCS-M cp932ms
> JIS X-0208+
> IBM extensions+
> MS extensions
>
> and so on. The etc. columns would contain all the other useful
> information about the coded character set (its usage, and all the
> various crossmappings to vendor and ISO and InterNet id's, etc.)

Where did you get your short tags from? The largest and most widely
used collection of tags in this area is the IANA "charset" registry.
at least three of four of your short tags are wrong in this respect;
it is iso-8859-1, utf-8, and utf-16.

> Note that unlike the IANA registry, which seems to treat the CES as
> a sometime thing, I consider it an essential attribute of *every*
> encoded character set. The terms I use above include:
>
> 8bit single character mapped to single byte, no C0/C1 restrictions
> 8bit-I single character mapped to single byte, ISO C0/C1 restrictions
> 8bit-E single character mapped to single byte, EBCDIC structure
> DBCS-M shift-based DBCS mapping
> SISO SI/SO run-encoded DBCS switching for EBCDIC hosts
>
> And of course there are the 7bit CES's and the various other
> EUC and 2022-based schemes, etc.

The IANA registry doesn't make any explicit mention of CES, it just
looks at the overall result. This makes much sense in practice.
Some of the MIME standards, which are the base for the IANA
registry, historically ignore CES (that's why they called an
encoding a "character set"), but while this terminology has
unfortunately been kept in the MIME standards, it is not
general IETF terminology.

> > But beyond that (and more importantly), now here is my real question:
> >
> > I'm working with several other people from various localization/tools
> > vendors companies to set up a standard format for translation memory
> > exchange (TMX). We use an XML-compliant format for this. One of the
> > problem we run into is naming one of the attribute of some of the
> > elements.

In XML, character encoding issues are very well defined, and more
or less follow the HTML model.

> > That attribute specifies what "encoded character set" the original text
> > was in (the text in TMX being always in Unicode, using ISO646 and
> > character references for code-points above 128). Two terms proposed
> > would be CODESET and CHARSET.

I don't know what TMX is. What you call "original text" is what comes
in to the parser, what is on the wire or in a file. TMX may denote
the internal (processing) code. This is not required to be Unicode,
but it has to behave as if it was.

> > Note that CHARSET is used in HTML, and according your various answer it
> > should not, note also that the IANA page where the name of the
> > "charsets/codesets" are listed (see
> > ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets) names
> > happily everything "character set" (including Unicode, UTF-8, UCS-2,
> > Shift-JIS, etc.)
>
> It absolutely is a horrible hodge-podge. The maintainers of the
> IANA registry are trying to tighten it up, but as Martin pointed
> out, there already are all kinds of useless entities registered
> as "character sets".

This is true. But what is more important, for actual practice, is
that this is also the broadest and most open and used collection
of useful tags, even though those in broad use are only a small
subset of the overall collection. So HTML and XML and many other
internet technologies are well advised to use (a carefully selected
subset) of the IANA tags.

 
> > The values for that attributes will be Unicode (UCS-2), UTF-8, cp850,
> > cp1252, Shift-JIS, EUC-JA, MacRoman, HPRoman8, etc. basically any (and
> > more) of the "codesets/charsets" listed in the IANA page.
>
> Actually, the values for the attributes should be conceived of as
> an entry in the table I have shown above. The Short Tag should be
> a unique identifier for that entry. Once you have matched an entry
> in the table as defined above, both the character encoding and the encoding
> scheme are unambiguously identified. This enables conversion
> from data representation (the byte stream) to characters, which
> is what the processing code needs in order to interpret the data.

Some of the attribute values above are wrong. It should be EUC-JP,
and SHIFT_JIS.

> > What attribute name should we use?
> > CHARSET looks incorrect according your various answers (and I agree).
> > CODESET seems to be not very in favor.
> > ENCODING then? but some are "encoding schemes" (Keld makes a clear
> > distinction between encoding and encoding scheme).
>
> Actually, "CHARSET" is probably your best choice, following HTML, I would
> think. In any case, you should do whatever is best practice in XML.

I agree. CHARSET is the best choice.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT