Re: Repertoire, encoding, and representation (Was: Charsets + encoding + codesets)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Oct 07 1997 - 17:42:12 EDT


>
> C'mon Ken, I know that Unicode is in love with 16 bits, but
> you must admit that 10646 is canonically defined as a 32 (31) bit
> coded character set.

Canonically defined as a sequence of four octets in a specified
order, with restrictions on the high bit set for the G-octet. So
yes, a 31-bit coded character set. Nobody is claiming otherwise, as
far as I know.

>
> I did not say anything about what is the default encoding in 10646.

The implication of your statement was that 10646 is to be considered
UCS-4 unless specified otherwise. That sounds like a default to me,
and has other potential implications for implementers.

>
> > The Unicode Standard can be considered a profile of 10646 that
> > designates UTF-16 as the preferred encoding scheme. In that sense it
> > clearly *does* designate a default encoding scheme, unlike 10646.
>
> That must be 2.0. Was it not different in 1.0?

Of course it is Unicode 2.0. Are you engaging in this debate without
having bothered to look at Unicode 2.0?

> Why did you not chose UTF-8 - everybody else seems to go that way?

Because UTF-8 is a lousy processing code for processes that care
about character boundaries and character semantics.

And the last time I checked, Microsoft, IBM, Apple, Justsystems,...
between them accounted for a significant share of the software market.
I guess your definition of "everybody" is different than mine, though.

> >
> > I'll state this one more time, because Keld keeps claiming it isn't
> > so:
> >
> > The repertoire of the Unicode Standard and of ISO/IEC 10646 are
> > *exactly* the same.
>
> That is possible, but then the definitions of "repertoire"
> are different for the two specifications. "I have 3 apples and
> you have 3 oranges. We have the same." :-)

Not true.

> And what about the
> "surrogates"? These are genuine characters in Unicode
> but not so in 10646.

Keld, this is another egregious piece of disinformation. The Unicode
"surrogate characters" are exactly the same as the "RC-elements"
specified in definition 4.30 of Amendment 1 to 10646 (UTF-16).
Surrogate characters have no independent interpretation as characters--
they are only interpretable as a pair of high-surrogate + low-surrogate
codes.

> >
> > Note that canonical equivalence does *not* mean duplicate encoding of
> > characters. It means two different representations of the same abstract
> > character--representations which under most circumstances should be
> > *interpreted* the same.
>
> Ken, you are doing tricks with words. Your "represented" term is
> what others would call "encoding" of the abstract character.

Only if they were deliberately misrepresenting the intent of the
terminological distinctions.

Sequences of characters are used to represent textual data.

  U+0077 + U+006F + U+0072 + U+0064 *represents* the word "word"

  U+092A + U+0942 *represents* the Devanagari syllable "puu"

  U+006F + U+0330 *represents* the vowel "o" with creaky voice (o with tilde
                     beneath it)

None of these is an instance of an *encoding* in the sense used for
associating a repertoire of characters with numbers in a coded
character set.

>
> > Note also that canonical equivalence also does not mean exact identity.
> > If your software process is allocating buffer space, it better not
> > treat U+00E1 the same as the sequence U+0061 + U+0301, or it will
> > overrun memory.
>
> But on the semantic level, abstract character level, I understand the
> two "representations" to be equivalent by definiton in Unicode.
> Am I correct?

Canonically equivalent, as defined in the conformance clause, yes. (Note
that this use of the term "canonical" is distinct from that used above
in reference to 10646's encoding forms.)

>
> > Keld is, of course, correct that the repertoire of abstract characters
> > is open. I just gave an example of an abstract character that could have
> > meaningful use in the transcription of a language, but it has never (to
> > my knowledge) been brought up before or discussed as a candidate to
> > be *encoded* as a character in 10646. That is not because it has two
> > accents; there are already such characters encoded in 10646, e.g.
> > U+01DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON. But the nature
> > of the Latin script is that it allows relatively free application of
> > accent marks to letter baseforms, either as diacritics to create new
> > "letters" for a particular orthography, or as accents to modify in various
> > ways the sounds represented by letters.
>
> So Unicode has an open repertoire of abstracts characers, while
> 10646 has a finite repertoire of (abstract) characters?

No, start over. The repertoire of abstract characters is open. That is
a class of things to be represented, either by encoded characters or
by sequences of encoded characters. That is the class of things that
standards committees argue over, and its membership is non-self-evident,
whether you believe in the use of combining marks to represent Latin
letters with accents or not.

I repeat in plain English:

Unicode's repertoire of characters is exactly the same as 10646's
repertoire of characters.

--Ken

>
> Keld
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT