Re: (TC304WG4.50) Charset vs. codeset

From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Sun Oct 05 1997 - 10:44:45 EDT


On Sat, 4 Oct 1997, Keld J|rn Simonsen wrote:

> I had a few comments to Kenneth Whistler's recent writing:

> > Unicode is an encoded character set.
>
> I am not so sure about that. It violates the general principles of
> that an encoded character set only encodes one (abstract) character
> in one way.
>
> > ISO/IEC 10646 is an encoded character set.
>
> True.

You are probably refering to cases like A + combining ring above
vs. A with ring above (sorry I don't remember the official names).

In that sense, both Unicode and ISO/IEC 10646 are very much the
same. Both include the possibilities to use combining marks.
Unicode is a little bit more explicit about them. But it doesn't
allow more things that ISO/IEC 10646. ISO/IEC doesn't explicitly
define equivalences, and therefore in theory, it's possible to
say that these are different abstract characters (or combinations
of them). But Unicode can say the same, namely that they are
different abstract characters/combinations. That the difference
shouldn't be visible to the user is patently obvious in both cases.

> > UCS-2, UCS-4, UTF-8, and UTF-16 are character encoding schemes (CES),
> > specifying the actual byte usage for a particular form of use
> > of the encoded characters from 10646.
>
> I would only say that for UTF-8 and UTF-16. UCS-2 and UCS-4 are
> coded character sets, not encoding schemes.

All of them are pretty much the same. UCS-2 and UCS-4, if they
are seen as mappings between characters and integers, are
coded character sets. If they are seen as mappings between
characters and bit combinations, or even more as bit/byte
streams representing characters, they also include a character
encoding scheme. Because of the BOM/endianness issue, that
CES is in fact not even exactly trivial.
For all of the above, they are not character encoding schemes
that can be applied to any coded character set whatsoever
(although in theory they could), but for which the coded
character set is clearly defined.

> > A MIME "charset" is a mapping from byte (~octet) values to characters.
> > Because different character encoding schemes change the actual byte
> > values that are associated with characters, there has to be one
> > MIME charset for UTF-8 and another MIME charset for UTF-16, even
> > though both refer to the same encoded character set, namely Unicode.
>
> Well technically MIME definitions refer to 10646.

When and where? The MIME definition for UTF-8, as far as I remember,
refers to both Unicode and 10646.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT