Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Mark Davis (markdavis34@home.com)
Date: Sat Feb 24 2001 - 16:02:21 EST


Ken has done a nice job of fleshing out the issues. I would add a bit to
that.

The glossary entry for "abstract character", as he points out, was inherited
from 10646.

"Abstract Character. A unit of information used for the organization,
control, or representation of textual data. (See Definition D3 in Section
3.3, Characters and Coded Representations.)"
[http://www.unicode.org/glossary/#abstract_character]

It is, of course, a *completely and utterly vacuous* definition.

According to those criteria, a 'bit' would also be an abstract character: It
is a unit of information. It is used for the representation of textual data
(without bits we wouldn't be able to represent them). Voila!

The problem is, there is no notion of what the "unit of information" is:
could be a bit, could be a magazine, could be a library, could be a law
regulating (controlling) encryption of (textual) data, could be an XML
element (controlling display of text), could be practically anything! One
cannot draw the conclusion from this definition that there is a 1:1
relationship between abstract characters and code points. As a matter of
fact, from this definition, one cannot draw many conclusions at all.

In the Unicode Standard and technical reports, we don't say that that an
abstract character must have a 1:1 relationship with a code point: where we
discuss it, we say precisely the opposite. Look at Figure 2-6 and the
surrounding text. The commentary around the definitions D3-D5 also points
that out, as does much of the text in UTR 17, section 2.2. There may be some
points where the wording is imprecise, and could be misleading, but the
intent and direction are clear. If you find text that seems to imply a 1:1
relationship, then please report it to errata@unicode.org.

Interpreting "abstract characters" as "the things encoded in a character
set" is not supported by the definition. Moreover, as well as being
completely circular, it drastically violates our expectations. If you were
not thinking of character sets, would you say that the language tag
introducer is an abstract character? The right-left mark? Is the "flip the
character over and turn it inside out control" simply an abstract character
waiting to be encoded? If Unicode decides to assign a codepoint to represent
my car -- not the image of my car, or a textual representation of my car,
but simply to represent my car -- does that suddenly make my car an abstract
character?

I think not.

As Ken says, we have let sleeping dogs lie as far as deviating from the
formal definition used by 10646; however, if this is causing people to
misinterpret the standard we should work with WG2 to come up with a useful,
valid definition.

Mark

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: "Unicode List" <unicode@unicode.org>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Friday, February 23, 2001 14:44
Subject: Re: An Aburdly Brief Introduction to Unicode (was Re: Perception
...)

> Peter expostulated:
>
> > I think Mark is either temporarily off his game, or else he's
obfuscating
> > terminology. "Abstract character" is defined in definition D3 on p. 40
of
> > TUS3.0. The relationship between abstract characters and codepoints is
> > defined in UTR17: "An abstract character is defined to be in a coded
> > character set if the coded character set maps from it to an integer.
That
> > integer is said to be the code point for the abstract character." UTR17
> > doesn't make this clear, but the mapping between abstract characters and
> > integers is a bifurcation, i.e. 1:1. Thus, it is impossible for multiple
> > abstract characters (as here defined) to map to a single codepoint, or
for
> > a single abstract character to map to multiple codepoints.
>
> As for everything in the Unicode Standard, simple things get complicated,
> and terminology slips away from us.
>
> The current glossary entry, it is true, defined "abstract character" as:
>
> "A unit of information used for the organization, control, or
> representation of textual data."
>
> That is deliberately chosen to be identical to the SC2 definition of
> "character", so people will know what we are talking about.
>
> In that sense, there is a certain tautology involved. A character encoding
> associates numbers with characters to encode them. The "encoding" is the
> number associated with the character. The "encoded character" is the
> character with its associated encoding. And the "character" is that which
> was encoded.
>
> So in this very limited sense of character (i.e., what the Unicode
Standard
> terms "abstract character"), there never could be other than a one-to-one
> relationship, and each abstract character has exactly one encoding.
>
> And the logical implication of this is that if I generated a character
> encoding that encoding the Latin small letter a 16 different times at
> different encoding points, there would be 16 different abstract characters
> for the representation of the letter a, rather than 16 different encodings
> for the same abstract character.
>
> However, ...
>
> This usage has always run counter to the sense that we all have that there
> are entities "out there" to be encoded, and that if it all possible, for
> usability of the standard, each one should only be encoded once. Encoding
> "a" 16 times in a character encoding standard might literally create 16
> abstract characters, but it doesn't twist reality with it to also install
> 16 letters "a" into the Latin alphabet. Call
> this concept, if you will "abstractable character", if that will help
> in distinguishing it from the definition of "abstract character" currently
> in the glossary.
>
> The Unicode Character Encoding Model has been using the term "abstract
> character" in this latter sense, as an element of a repertoire, abstracted
> prior to any concern for encoding per se.
>
> If you look at it this way, it is clearly possible for one "abstractable
> character" to end up being encoded twice, or even more times in the
> standard. That is, in fact, what singleton canonical mappings are all
> about. They are determinations by the Unicode Technical Committee that
> a character represents a *duplicate* encoding, for whatever legacy
> compatibility reasons, of the same "abstractable character". If done
> from scratch, the standard would delete them all as duplications in the
> encoding, but of course we cannot delete anything -- even encoded
characters
> determined to be duplicates.
>
> This sense of "abstractable character", i.e., a member of the set of
> entities in the abstract repertoire that is eligible for encoding as
> a character, is what Mark had in mind. And, in fact, if you look at Figure
> 2-6, on page 19 of the standard, you will see exactly the kind of usage
> that Mark was drawing on, using the very example of {a-with-ring} that
> has drawn fire here.
>
> Re Peter's earlier assertion:
>
> > {a with ring above} is not an abstract character according to the
> > definition used in the standard. It may be a grapheme in one or more
> > writing system; it may be any number of objects, but it is not an
abstract
> > character in the Unicode repertoire. LATIN CAPITAL LETTER WITH RING
ABOVE
> > and ANGSTROM SIGN are abstract characters, and are different. They
happen
> > to be canonically equivalent, but that is beside the point and does not
> > mean that they are not different abstract characters.
>
> Actually, {a with ring above} is an abstract character in either of
> the two senses I have talked about above.
>
> {a with ring above} is an (abstract character)1, in that it is a
> "unit of information used for the ... representation of textual data".
> In the Unicode Standard, it has an encoding of 0x00C5 and a name of
> "LATIN CAPITAL LETTER A WITH RING ABOVE". And the encoded character is:
>
> U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
>
> {a with ring above} is an (abstract[able] character)2, in that it is
> a member of the abstract repertoire of entities that are encoded as
> characters in the Unicode Standard. And in the Unicode Standard, that
> abstractable character has two encodings, i.e., is associated with
> two encoded characters:
>
> U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
> U+212B ANGSTROM SIGN
>
> That that double encoding is of the *same* abstract[able] character is
> a determination by the UTC, and is recorded by the singleton canonical
> mapping in the data table.
>
> I guess this is just one more piece of the terminological mess regarding
> character encoding that we will have to wrestle with when we work on
> retrofitting the Character Encoding Model onto the text of the standard
> for Unicode 4.0.
>
> >
> > This seems so obvious to me, and I'm very surprised to here this coming
> > from no less than Messrs. Davis and Whistler. It must mean I'm missing
> > something, but I'm sure I'm not. What's in the water these days out in
the
> > Bay area?
>
> Flouride. That must be it.
>
> Mike Brown said:
>
> > From mbrown@webb.net Fri Feb 23 13:11 PST 2001
> > To: "'Kenneth Whistler'" <kenw>
> >
> > Hmm. I was under the impression that LATIN CAPITAL LETTER A WITH RING
ABOVE
> > and ANGSTROM SIGN are two distinct characters with distinct semantics
that
> > happen to have canonical equivalence in Unicode because they are
visually
> > indistinguishable. I didn't think this interchangability made them
> > necessarily be the same single abstract character "a with ring above" as
> > your example states. Am I mistaken?
> >
> > - Mike
> > ____________________________________________________________________
> > Mike J. Brown, software engineer at My XML/XSL resources:
> > webb.net in Denver, Colorado, USA http://skew.org/xml/
> >
> >
> > PS- I can't cc the list at the moment because my employer changed my
address
> > on me; my posts won't go through. Feel free to forward upon reply.
>
> Visual indistinguishability is insufficient grounds. On that alone, we
> would end up equating Latin O, Cyrillic O, and Greek O, but of course,
they
> have been long treated as distinct abstractable characters, and are
> distinct (abstract characters)1 in the Unicode Standard, as well. No
> canonical equivalence is made, nor should there be.
>
> But the ANGSTROM SIGN is simply a compatibility character pulled into the
> standard because of a mistaken disunification of function in one of the
> source Asian standards which was grandfathered into Unicode for legacy
> convertibility. SI units are simply Latin letters. There is no separate
> "A" for ampere, or "C" for coulomb, or "s" for second, or "k" for "kilo-"
> and so on. The ANGSTROM SIGN was just a standardization mistake for "Å"
> in this collection of SI units in an Asian standard, comparable to the
> cruft in the 33XX block, but standing out merely because it is a single
> letter rather than a square block of letters.
>
> This is completely comparable to the situation for CJK Compatibility
> characters. Look at the duplicates from the Korean Standard, e.g.,
> U+F907, U+F908. Those are not "different" characters. They are the
> *same* characters and are the same as the unified Han character for
"turtle",
> i.e. U+9F9C. Or trying to put it more precisely, there is one abstractable
> Han character here, but it got encoded 3 times. And two of those instances
> are then labelled with a canonical equivalence that marks them as
> duplicates and points to the "real" encoded character.
>
> (And if the Han quibblers get hung up by the glyphic variability that is
> notorious for the "turtle" character, then by all means consider simpler
> examples like U+F963 "north", U+F967 "not", or U+F981 "woman", where
> Z-variation is not even an issue.)
>
> --Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT