Re: Charsets + encoding + codesets

From: Martin J. Dürst (
Date: Wed Oct 08 1997 - 08:50:24 EDT

On Tue, 7 Oct 1997, Keld J|rn Simonsen wrote:

> Kenneth Whistler writes:

> I would rather say that the character set of 10646 is the repertoire
> of 10646 which is the characters in the codepoints of 10646. This
> is a finite repertoire, although it may differ which each
> amendment. But you can always count the characters in there.
> For Unicode it is a different story. Unicode can represent an
> undefined number of "abstract characters" which is the Unicode
> equivalent term to the ISO term "character". (I even use that
> term to clarify the difference to a "coded character").
> Unicode's repertoire is thus infinite.

I agree with others:
Unicode has exactly the same codepoints/characters, and therefore
the same repertoire, as ISO 10646. It can represent exactly
the same (unlimited) "abstract characters"/diacritic
combinations, as well as the same words, sentences,...
as ISO 10646. It may use different terminology in some
cases, but that doesn't make it different. It includes
some more detailled specifications that ISO 10646 includes
more implicitly. But even ISO 10646 speaks about how to
combine base characters and combining marks to represent
diacritic combinations,...

> > 4. The mathematical relation (a unique and symmetric mapping function) between a
> > character repertoire and coded representations. This is synonymous with the
> > term "coded character set" as defined in 10646: "A set of unambiguous
> > rules that establishes a character set and the relationship between
> > the characters of the set and their coded representations." [The important
> > thing here is that each character is associated with a number, and each
> > numerical value is unambiguously related to a character.]
> I think there are some subtle differences here. I believe that
> the coded character set do imply a binary representation.
> All coded character sets that I know of have a binary representation.

Can you tell me what the binary representation of JIS X 0208 is?
Is the eighth bit in each byte set or not? Okay, you may say that
it is undefined; I think you said something in an earlier mail.
But then, what about Shift-JIS (which is a normative annex to
JIS X 0208:1997)?

I do not oppose your "imply a binary representation" so much
from a technical viewpoint, although it has its technical
problems, too. The bigest problem is from an didactic
viewpoint. If you include the binary representation, it's
easy for people in the ASCII/8-bit world to think that there
is nothing else. And that makes them ignore some important
aspects of the whole thing, with bad consequences in all
kinds of places.

> Also the numbering is not done normally, and even if you say
> there is an implied numbering, a number of coded character sets
> have smaller or bigger holes in this numbering.

Holes are not a problem for the numbering.

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT