Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Peter_Constable@sil.org
Date: Mon Feb 26 2001 - 11:09:22 EST


On 02/24/2001 02:36:26 PM "Mark Davis" wrote:

>The glossary entry for "abstract character", as he points out, was
inherited
>from 10646.
>
>"Abstract Character. A unit of information used for the organization,
>control, or representation of textual data. (See Definition D3 in Section
>3.3, Characters and Coded Representations.)"
>[http://www.unicode.org/glossary/#abstract_character]
>
>It is, of course, a *completely and utterly vacuous* definition.

I don't see that that's the case. A definition isn't vacuous if it succeeds
in helping you and me understand what each other is talking about. Formal
language systems require formal and explicit definitions for terminology
and for constructed notions to be used within the system. This is all
internal to the system. For humans talking about the system, notions and
objects that exist within in the system and that represent something that
is salient to us that we'd like to be able to refer to, then of course we
need a term for that, and we need to provide some definition so that other
humans encountering the term for the first time know what kinds of things
we mean by it. We might use an extensional definition (define in terms of
the set of all instances), ostensively (by pointing out an instance), or
intentionally (by giving a set of defining criteria). The kind of
definition doesn't matter. All that matters is that we know what the other
means.

The notion of "the units of information used for the organization, control,
or representation of textual data that the Standard encodes" is a notion we
need to be able to make use of in our discussions, and so it is fit to have
a term to refer to it.

>According to those criteria, a 'bit' would also be an abstract character:
It
>is a unit of information. It is used for the representation of textual
data
>(without bits we wouldn't be able to represent them). Voila!

But a bit itself does not constitute a complete unit that is of the type
that we need to refer to in our discussions by anything other than "bit".
It is highly unlikely that anybody will confuse the above definition to
include bits. This doesn't demonstrate that the above definition is
vacuous. It merely cites a way in which more careful clarification may be
needed. (In this case, I don't think it is needed.)

>The problem is, there is no notion of what the "unit of information" is:
>could be a bit, could be a magazine, could be a library...

Sure, it *could* be any of these things and more. But we all know that it
is not.

>One
>cannot draw the conclusion from this definition that there is a 1:1
>relationship between abstract characters and code points. As a matter of
>fact, from this definition, one cannot draw many conclusions at all.

True, if we assume *no other knowledge*. (I can't believe we're discussing
exegesis in this context!) But there is an implicit interpretation that has
seemed clear, at least, to me when looking at the Standard as a whole. The
units in the definition are the atomic units that consitute the elements of
the character repertoire of the Standard. The point of the definition is to
give those a name that we can use to refer to them in discussion, and to
clarify that they include both units that *represent* textual information,
but also units for *control* and *organization* of such representation.

>In the Unicode Standard and technical reports, we don't say that that an
>abstract character must have a 1:1 relationship with a code point: where
we
>discuss it, we say precisely the opposite. Look at Figure 2-6 and the
>surrounding text.

Well, evidently the impression I got "from the Standard as a whole" was
incomplete. You have now convinced me that the above definition *is*
vacuous, not because the wording of the definition itself is somehow
deficient, but because the very notion intended by those that wrote the
definition appears to be vacuous. Clearly, we haven't understood what one
another have been talking about up to now. It is the lack of such common
understanding that lead people to say that others are wrong, are
temporarily insane, are "off their game", and other such things when the
two are merely talking about different things.

As I indicated above, I think that there is a non-vacuous notion that
merits a specific term for the purposes of discussion, and that that notion
is the one I have been assuming up to now.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

The commentary around the definitions D3-D5 also points
>that out, as does much of the text in UTR 17, section 2.2. There may be
some
>points where the wording is imprecise, and could be misleading, but the
>intent and direction are clear. If you find text that seems to imply a 1:1
>relationship, then please report it to errata@unicode.org.
>
>Interpreting "abstract characters" as "the things encoded in a character
>set" is not supported by the definition. Moreover, as well as being
>completely circular, it drastically violates our expectations. If you were
>not thinking of character sets, would you say that the language tag
>introducer is an abstract character? The right-left mark? Is the "flip the
>character over and turn it inside out control" simply an abstract
character
>waiting to be encoded? If Unicode decides to assign a codepoint to
represent
>my car -- not the image of my car, or a textual representation of my car,
>but simply to represent my car -- does that suddenly make my car an
abstract
>character?
>
>I think not.
>
>As Ken says, we have let sleeping dogs lie as far as deviating from the
>formal definition used by 10646; however, if this is causing people to
>misinterpret the standard we should work with WG2 to come up with a
useful,
>valid definition.
>
>Mark
>
>----- Original Message -----
>From: "Kenneth Whistler" <kenw@sybase.com>
>To: "Unicode List" <unicode@unicode.org>
>Cc: <unicode@unicode.org>; <kenw@sybase.com>
>Sent: Friday, February 23, 2001 14:44
>Subject: Re: An Aburdly Brief Introduction to Unicode (was Re: Perception
>...)
>
>
>> Peter expostulated:
>>
>> > I think Mark is either temporarily off his game, or else he's
>obfuscating
>> > terminology. "Abstract character" is defined in definition D3 on p. 40
>of
>> > TUS3.0. The relationship between abstract characters and codepoints is
>> > defined in UTR17: "An abstract character is defined to be in a coded
>> > character set if the coded character set maps from it to an integer.
>That
>> > integer is said to be the code point for the abstract character."
UTR17
>> > doesn't make this clear, but the mapping between abstract characters
and
>> > integers is a bifurcation, i.e. 1:1. Thus, it is impossible for
multiple
>> > abstract characters (as here defined) to map to a single codepoint, or
>for
>> > a single abstract character to map to multiple codepoints.
>>
>> As for everything in the Unicode Standard, simple things get
complicated,
>> and terminology slips away from us.
>>
>> The current glossary entry, it is true, defined "abstract character" as:
>>
>> "A unit of information used for the organization, control, or
>> representation of textual data."
>>
>> That is deliberately chosen to be identical to the SC2 definition of
>> "character", so people will know what we are talking about.
>>
>> In that sense, there is a certain tautology involved. A character
encoding
>> associates numbers with characters to encode them. The "encoding" is the
>> number associated with the character. The "encoded character" is the
>> character with its associated encoding. And the "character" is that
which
>> was encoded.
>>
>> So in this very limited sense of character (i.e., what the Unicode
>Standard
>> terms "abstract character"), there never could be other than a
one-to-one
>> relationship, and each abstract character has exactly one encoding.
>>
>> And the logical implication of this is that if I generated a character
>> encoding that encoding the Latin small letter a 16 different times at
>> different encoding points, there would be 16 different abstract
characters
>> for the representation of the letter a, rather than 16 different
encodings
>> for the same abstract character.
>>
>> However, ...
>>
>> This usage has always run counter to the sense that we all have that
there
>> are entities "out there" to be encoded, and that if it all possible, for
>> usability of the standard, each one should only be encoded once.
Encoding
>> "a" 16 times in a character encoding standard might literally create 16
>> abstract characters, but it doesn't twist reality with it to also
install
>> 16 letters "a" into the Latin alphabet. Call
>> this concept, if you will "abstractable character", if that will help
>> in distinguishing it from the definition of "abstract character"
currently
>> in the glossary.
>>
>> The Unicode Character Encoding Model has been using the term "abstract
>> character" in this latter sense, as an element of a repertoire,
abstracted
>> prior to any concern for encoding per se.
>>
>> If you look at it this way, it is clearly possible for one "abstractable
>> character" to end up being encoded twice, or even more times in the
>> standard. That is, in fact, what singleton canonical mappings are all
>> about. They are determinations by the Unicode Technical Committee that
>> a character represents a *duplicate* encoding, for whatever legacy
>> compatibility reasons, of the same "abstractable character". If done
>> from scratch, the standard would delete them all as duplications in the
>> encoding, but of course we cannot delete anything -- even encoded
>characters
>> determined to be duplicates.
>>
>> This sense of "abstractable character", i.e., a member of the set of
>> entities in the abstract repertoire that is eligible for encoding as
>> a character, is what Mark had in mind. And, in fact, if you look at
Figure
>> 2-6, on page 19 of the standard, you will see exactly the kind of usage
>> that Mark was drawing on, using the very example of {a-with-ring} that
>> has drawn fire here.
>>
>> Re Peter's earlier assertion:
>>
>> > {a with ring above} is not an abstract character according to the
>> > definition used in the standard. It may be a grapheme in one or more
>> > writing system; it may be any number of objects, but it is not an
>abstract
>> > character in the Unicode repertoire. LATIN CAPITAL LETTER WITH RING
>ABOVE
>> > and ANGSTROM SIGN are abstract characters, and are different. They
>happen
>> > to be canonically equivalent, but that is beside the point and does
not
>> > mean that they are not different abstract characters.
>>
>> Actually, {a with ring above} is an abstract character in either of
>> the two senses I have talked about above.
>>
>> {a with ring above} is an (abstract character)1, in that it is a
>> "unit of information used for the ... representation of textual data".
>> In the Unicode Standard, it has an encoding of 0x00C5 and a name of
>> "LATIN CAPITAL LETTER A WITH RING ABOVE". And the encoded character is:
>>
>> U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
>>
>> {a with ring above} is an (abstract[able] character)2, in that it is
>> a member of the abstract repertoire of entities that are encoded as
>> characters in the Unicode Standard. And in the Unicode Standard, that
>> abstractable character has two encodings, i.e., is associated with
>> two encoded characters:
>>
>> U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
>> U+212B ANGSTROM SIGN
>>
>> That that double encoding is of the *same* abstract[able] character is
>> a determination by the UTC, and is recorded by the singleton canonical
>> mapping in the data table.
>>
>> I guess this is just one more piece of the terminological mess regarding
>> character encoding that we will have to wrestle with when we work on
>> retrofitting the Character Encoding Model onto the text of the standard
>> for Unicode 4.0.
>>
>> >
>> > This seems so obvious to me, and I'm very surprised to here this
coming
>> > from no less than Messrs. Davis and Whistler. It must mean I'm missing
>> > something, but I'm sure I'm not. What's in the water these days out in
>the
>> > Bay area?
>>
>> Flouride. That must be it.
>>
>> Mike Brown said:
>>
>> > From mbrown@webb.net Fri Feb 23 13:11 PST 2001
>> > To: "'Kenneth Whistler'" <kenw>
>> >
>> > Hmm. I was under the impression that LATIN CAPITAL LETTER A WITH RING
>ABOVE
>> > and ANGSTROM SIGN are two distinct characters with distinct semantics
>that
>> > happen to have canonical equivalence in Unicode because they are
>visually
>> > indistinguishable. I didn't think this interchangability made them
>> > necessarily be the same single abstract character "a with ring above"
as
>> > your example states. Am I mistaken?
>> >
>> > - Mike
>> > ____________________________________________________________________
>> > Mike J. Brown, software engineer at My XML/XSL resources:
>> > webb.net in Denver, Colorado, USA http://skew.org/xml/
>> >
>> >
>> > PS- I can't cc the list at the moment because my employer changed my
>address
>> > on me; my posts won't go through. Feel free to forward upon reply.
>>
>> Visual indistinguishability is insufficient grounds. On that alone, we
>> would end up equating Latin O, Cyrillic O, and Greek O, but of course,
>they
>> have been long treated as distinct abstractable characters, and are
>> distinct (abstract characters)1 in the Unicode Standard, as well. No
>> canonical equivalence is made, nor should there be.
>>
>> But the ANGSTROM SIGN is simply a compatibility character pulled into
the
>> standard because of a mistaken disunification of function in one of the
>> source Asian standards which was grandfathered into Unicode for legacy
>> convertibility. SI units are simply Latin letters. There is no separate
>> "A" for ampere, or "C" for coulomb, or "s" for second, or "k" for
"kilo-"
>> and so on. The ANGSTROM SIGN was just a standardization mistake for "Å"
>> in this collection of SI units in an Asian standard, comparable to the
>> cruft in the 33XX block, but standing out merely because it is a single
>> letter rather than a square block of letters.
>>
>> This is completely comparable to the situation for CJK Compatibility
>> characters. Look at the duplicates from the Korean Standard, e.g.,
>> U+F907, U+F908. Those are not "different" characters. They are the
>> *same* characters and are the same as the unified Han character for
>"turtle",
>> i.e. U+9F9C. Or trying to put it more precisely, there is one
abstractable
>> Han character here, but it got encoded 3 times. And two of those
instances
>> are then labelled with a canonical equivalence that marks them as
>> duplicates and points to the "real" encoded character.
>>
>> (And if the Han quibblers get hung up by the glyphic variability that is
>> notorious for the "turtle" character, then by all means consider simpler
>> examples like U+F963 "north", U+F967 "not", or U+F981 "woman", where
>> Z-variation is not even an issue.)
>>
>> --Ken
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT