Re: (TC304WG4.50) Charset vs. codeset

From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Wed Oct 08 1997 - 09:33:47 EDT


On Tue, 7 Oct 1997, Keld J|rn Simonsen wrote:

> Yes, Unicode and 10646 are very much the same, but there are subtle
> differences, because of their different definitions and terminology.

There are subtle, and sometimes not so subtle, differences in
terminology, but these don't cause differences in actual facts.

> My understanding form earlier postings (some months ago) from Ken
> Whistler is that the precomposed and the base-letter + combining accent
> represent the same abstract character in Unicode.

They are equivalences, yes. But this is about the same as
saying that "A" and "a" represent the same (case-folded) character.
If you have a term "case-folded character", or you define "character"
to mean "case-folded character", then in that environment, the
above statement is a correct factual statement, without being
in any factual conflict whatsoever with other standards that
use the word "character" in a different sense.

> This is not
> the case in 10646, as clearly indicated eg in 4.13. "a composite
> sequence is not a character and therefore is not a member of the
> repertoire of ISO/IEC 10646."

No problem with that. The above is a correct statement (by the
fact that it is normative) for ISO/IEC 10646 in the terms of
ISO/IEC 10646. There is nothing in Unicode that contradicts this,
i.e. that says "a composite sequence (in the sense of ISO/IEC
10646) is not a character (in the sense of ISO/IEC 10646) and
therefore is not a member of the repertoire (in the sense of
ISO/IEC 10646) of ISO/IEC 10646 (or Unicode)."

> > > > UCS-2, UCS-4, UTF-8, and UTF-16 are character encoding schemes (CES),
> > > > specifying the actual byte usage for a particular form of use
> > > > of the encoded characters from 10646.
> > >
> > > I would only say that for UTF-8 and UTF-16. UCS-2 and UCS-4 are
> > > coded character sets, not encoding schemes.
> >
> > All of them are pretty much the same. UCS-2 and UCS-4, if they
> > are seen as mappings between characters and integers, are
> > coded character sets. If they are seen as mappings between
> > characters and bit combinations, or even more as bit/byte
> > streams representing characters, they also include a character
> > encoding scheme. Because of the BOM/endianness issue, that
> > CES is in fact not even exactly trivial.
>
> I agree that you could call both UCS-2 and UCS-4 for coded character
> sets and also for character encodings. I dont think they are
> character encoding schemes, in the sense that they describe a
> function to combine or encode a coded character set.

Well, if you take the BOM (which is not part of the actual
text if it appears at the beginnig), you could apply that to
any sequence of (half-)words streamed into bytes.

> In the terminology that I am promoting, a coded character set always
> implies an encoding. This is consistent with ISO terminology for
> the term "coded character set".

But it is easier and more straightforward to do it otherwise,
and better suits the actual facts if you look at many cases.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT