Re: (TC304WG4.50) Charset vs. codeset

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Tue Oct 07 1997 - 15:43:35 EDT


=?iso-8859-1?Q?Martin_J=2E_D=FCrst?= writes:

> On Sat, 4 Oct 1997, Keld J|rn Simonsen wrote:
>
> > I had a few comments to Kenneth Whistler's recent writing:
>
> > > Unicode is an encoded character set.
> >
> > I am not so sure about that. It violates the general principles of
> > that an encoded character set only encodes one (abstract) character
> > in one way.
> >
> > > ISO/IEC 10646 is an encoded character set.
> >
> > True.
>
> You are probably refering to cases like A + combining ring above
> vs. A with ring above (sorry I don't remember the official names).
>
> In that sense, both Unicode and ISO/IEC 10646 are very much the
> same. Both include the possibilities to use combining marks.
> Unicode is a little bit more explicit about them. But it doesn't
> allow more things that ISO/IEC 10646. ISO/IEC doesn't explicitly
> define equivalences, and therefore in theory, it's possible to
> say that these are different abstract characters (or combinations
> of them). But Unicode can say the same, namely that they are
> different abstract characters/combinations. That the difference
> shouldn't be visible to the user is patently obvious in both cases.
>
Yes, Unicode and 10646 are very much the same, but there are subtle
differences, because of their different definitions and terminology.
My understanding form earlier postings (some months ago) from Ken
Whistler is that the precomposed and the base-letter + combining accent
represent the same abstract character in Unicode. This is not
the case in 10646, as clearly indicated eg in 4.13. "a composite
sequence is not a character and therefore is not a member of the
repertoire of ISO/IEC 10646."

> > > UCS-2, UCS-4, UTF-8, and UTF-16 are character encoding schemes (CES),
> > > specifying the actual byte usage for a particular form of use
> > > of the encoded characters from 10646.
> >
> > I would only say that for UTF-8 and UTF-16. UCS-2 and UCS-4 are
> > coded character sets, not encoding schemes.
>
> All of them are pretty much the same. UCS-2 and UCS-4, if they
> are seen as mappings between characters and integers, are
> coded character sets. If they are seen as mappings between
> characters and bit combinations, or even more as bit/byte
> streams representing characters, they also include a character
> encoding scheme. Because of the BOM/endianness issue, that
> CES is in fact not even exactly trivial.

I agree that you could call both UCS-2 and UCS-4 for coded character
sets and also for character encodings. I dont think they are
character encoding schemes, in the sense that they describe a
function to combine or encode a coded character set. They both
do indicate both a repertoire and a way to encode them.

In the terminology that I am promoting, a coded character set always
implies an encoding. This is consistent with ISO terminology for
the term "coded character set".

> For all of the above, they are not character encoding schemes
> that can be applied to any coded character set whatsoever
> (although in theory they could), but for which the coded
> character set is clearly defined.
>
Agree.

> > > A MIME "charset" is a mapping from byte (~octet) values to characters.
> > > Because different character encoding schemes change the actual byte
> > > values that are associated with characters, there has to be one
> > > MIME charset for UTF-8 and another MIME charset for UTF-16, even
> > > though both refer to the same encoded character set, namely Unicode.
> >
> > Well technically MIME definitions refer to 10646.
>
> When and where? The MIME definition for UTF-8, as far as I remember,
> refers to both Unicode and 10646.

Yes, it refers to them both, but the 10646 UCS-4 is the one that it
is defined upon. See RFC 2044.

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT