Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Feb 23 2001 - 18:06:03 EST


Peter expostulated:

> I think Mark is either temporarily off his game, or else he's obfuscating
> terminology. "Abstract character" is defined in definition D3 on p. 40 of
> TUS3.0. The relationship between abstract characters and codepoints is
> defined in UTR17: "An abstract character is defined to be in a coded
> character set if the coded character set maps from it to an integer. That
> integer is said to be the code point for the abstract character." UTR17
> doesn't make this clear, but the mapping between abstract characters and
> integers is a bifurcation, i.e. 1:1. Thus, it is impossible for multiple
> abstract characters (as here defined) to map to a single codepoint, or for
> a single abstract character to map to multiple codepoints.

As for everything in the Unicode Standard, simple things get complicated,
and terminology slips away from us.

The current glossary entry, it is true, defined "abstract character" as:

  "A unit of information used for the organization, control, or
   representation of textual data."

That is deliberately chosen to be identical to the SC2 definition of
"character", so people will know what we are talking about.

In that sense, there is a certain tautology involved. A character encoding
associates numbers with characters to encode them. The "encoding" is the
number associated with the character. The "encoded character" is the
character with its associated encoding. And the "character" is that which
was encoded.

So in this very limited sense of character (i.e., what the Unicode Standard
terms "abstract character"), there never could be other than a one-to-one
relationship, and each abstract character has exactly one encoding.

And the logical implication of this is that if I generated a character
encoding that encoding the Latin small letter a 16 different times at
different encoding points, there would be 16 different abstract characters
for the representation of the letter a, rather than 16 different encodings
for the same abstract character.

However, ...

This usage has always run counter to the sense that we all have that there
are entities "out there" to be encoded, and that if it all possible, for
usability of the standard, each one should only be encoded once. Encoding
"a" 16 times in a character encoding standard might literally create 16
abstract characters, but it doesn't twist reality with it to also install
16 letters "a" into the Latin alphabet. Call
this concept, if you will "abstractable character", if that will help
in distinguishing it from the definition of "abstract character" currently
in the glossary.

The Unicode Character Encoding Model has been using the term "abstract
character" in this latter sense, as an element of a repertoire, abstracted
prior to any concern for encoding per se.

If you look at it this way, it is clearly possible for one "abstractable
character" to end up being encoded twice, or even more times in the
standard. That is, in fact, what singleton canonical mappings are all
about. They are determinations by the Unicode Technical Committee that
a character represents a *duplicate* encoding, for whatever legacy
compatibility reasons, of the same "abstractable character". If done
from scratch, the standard would delete them all as duplications in the
encoding, but of course we cannot delete anything -- even encoded characters
determined to be duplicates.

This sense of "abstractable character", i.e., a member of the set of
entities in the abstract repertoire that is eligible for encoding as
a character, is what Mark had in mind. And, in fact, if you look at Figure
2-6, on page 19 of the standard, you will see exactly the kind of usage
that Mark was drawing on, using the very example of {a-with-ring} that
has drawn fire here.

Re Peter's earlier assertion:

> {a with ring above} is not an abstract character according to the
> definition used in the standard. It may be a grapheme in one or more
> writing system; it may be any number of objects, but it is not an abstract
> character in the Unicode repertoire. LATIN CAPITAL LETTER WITH RING ABOVE
> and ANGSTROM SIGN are abstract characters, and are different. They happen
> to be canonically equivalent, but that is beside the point and does not
> mean that they are not different abstract characters.

Actually, {a with ring above} is an abstract character in either of
the two senses I have talked about above.

{a with ring above} is an (abstract character)1, in that it is a
"unit of information used for the ... representation of textual data".
In the Unicode Standard, it has an encoding of 0x00C5 and a name of
"LATIN CAPITAL LETTER A WITH RING ABOVE". And the encoded character is:

    U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE

{a with ring above} is an (abstract[able] character)2, in that it is
a member of the abstract repertoire of entities that are encoded as
characters in the Unicode Standard. And in the Unicode Standard, that
abstractable character has two encodings, i.e., is associated with
two encoded characters:

    U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
    U+212B ANGSTROM SIGN

That that double encoding is of the *same* abstract[able] character is
a determination by the UTC, and is recorded by the singleton canonical
mapping in the data table.

I guess this is just one more piece of the terminological mess regarding
character encoding that we will have to wrestle with when we work on
retrofitting the Character Encoding Model onto the text of the standard
for Unicode 4.0.

>
> This seems so obvious to me, and I'm very surprised to here this coming
> from no less than Messrs. Davis and Whistler. It must mean I'm missing
> something, but I'm sure I'm not. What's in the water these days out in the
> Bay area?

Flouride. That must be it.

Mike Brown said:

> From mbrown@webb.net Fri Feb 23 13:11 PST 2001
> To: "'Kenneth Whistler'" <kenw>
>
> Hmm. I was under the impression that LATIN CAPITAL LETTER A WITH RING ABOVE
> and ANGSTROM SIGN are two distinct characters with distinct semantics that
> happen to have canonical equivalence in Unicode because they are visually
> indistinguishable. I didn't think this interchangability made them
> necessarily be the same single abstract character "a with ring above" as
> your example states. Am I mistaken?
>
> - Mike
> ____________________________________________________________________
> Mike J. Brown, software engineer at My XML/XSL resources:
> webb.net in Denver, Colorado, USA http://skew.org/xml/
>
>
> PS- I can't cc the list at the moment because my employer changed my address
> on me; my posts won't go through. Feel free to forward upon reply.

Visual indistinguishability is insufficient grounds. On that alone, we
would end up equating Latin O, Cyrillic O, and Greek O, but of course, they
have been long treated as distinct abstractable characters, and are
distinct (abstract characters)1 in the Unicode Standard, as well. No
canonical equivalence is made, nor should there be.

But the ANGSTROM SIGN is simply a compatibility character pulled into the
standard because of a mistaken disunification of function in one of the
source Asian standards which was grandfathered into Unicode for legacy
convertibility. SI units are simply Latin letters. There is no separate
"A" for ampere, or "C" for coulomb, or "s" for second, or "k" for "kilo-"
and so on. The ANGSTROM SIGN was just a standardization mistake for "Å"
in this collection of SI units in an Asian standard, comparable to the
cruft in the 33XX block, but standing out merely because it is a single
letter rather than a square block of letters.

This is completely comparable to the situation for CJK Compatibility
characters. Look at the duplicates from the Korean Standard, e.g.,
U+F907, U+F908. Those are not "different" characters. They are the
*same* characters and are the same as the unified Han character for "turtle",
i.e. U+9F9C. Or trying to put it more precisely, there is one abstractable
Han character here, but it got encoded 3 times. And two of those instances
are then labelled with a canonical equivalence that marks them as
duplicates and points to the "real" encoded character.

(And if the Han quibblers get hung up by the glyphic variability that is
notorious for the "turtle" character, then by all means consider simpler
examples like U+F963 "north", U+F967 "not", or U+F981 "woman", where
Z-variation is not even an issue.)

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT