Re: Abstract character?

From: Barry Caplan (bcaplan@i18n.com)
Date: Mon Jul 22 2002 - 17:34:59 EDT


I usually define an abstract character in talks I give as "an element of a writing system that you care about, independent of glyphs, and certainly independent of endings or specific code points".

If it could be described more precisely than that, it wouldn't be "abstract", would it? :)

This is usually brought up in a series of definitions leading from "character" (what we are referring to here as "abstract" character, and then:

- "character list" - a list of "characters" one is interested in
- "character set" - a list of "character lists", which may or may not be ordered, but still has no codepoints
- "encoding scheme" - an algorithm for assigning code points to a "character set"
- "code point" the representation of an "abstract character" in an "encoding scheme"
- "font" - a series of glyphs that are used to display a characters represented by code points, in their immediate context

All of this is filled with examples - building to an explanation of Unicode. For example, wrt "abstract character, I ask the audience to ponder if "upper case A" and "lower case a", are the same "abstract character". Also, I ask them to ponder if "lower case a" displayed in "Helvetica" is the same "character as "lower case a" in " Times Roman". Finally, how about "lower case a in 9 point Helvetica" and "lower case a in 18 point Helvetica"?

And apropos a thread from last week, Unicode introduces new concepts such as "character properties" which means the anticipation and intrigue I spend time building in the audience that there is a neat solution to the historical morass I just spent 40 minutes describing, gets thoroughly dashed! Joy!

Implicit in this set of definitions is of course that a "character" may or may not be of interest to all "character lists", and therefore may or may not end of represented in more than one encoding. Also note that even when it does end up in more than one, this model in no way implies a round trip capability.

This leads nicely into a discussion about some very important aspects of internationalizing code and working with 3rd party components..

Barry Caplan
www.i18n.com

At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote:
>Lars Marius Garshol asked:
>
>> I'm trying to find out what an abstract character is. I've been
>> looking at chapter 3 of Unicode 3.0, without really achieving
>> enlightenment.
>>
>> The term Unicode scalar value (apparently synonymous with code point)
>> seems clear. It is the identifying number assigned to assigned
>> Unicode characters.
>
>Here is one of my attempts at a more rigorous term rectification:
>
>Abstract character
>
> that which is encoded; an element of the repertoire (existing
> independent of the character encoding standard, and often
> identifiable in other character encoding standards, as well
> as the Unicode Standard); the implicit basis of transcodings.
>
> Note that while in some sense abstract characters exist a
> priori by virtue of the nature of the units of various writing
> systems, their exact nature is only pinned down at the point
> that an actual encoding is done. They are not always obvious,
> and many new abstract characters may arise as the result of
> particular textual processing needs that can be addressed by
> characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
> etc., etc.)
>
>Code point
>
> A number from 0..10FFFF; a "point" in the codespace 0..10FFFF.
>
>Encoded character
>
> An *association* of an abstract character with a code point.
>
>Unicode scalar value
>
> A number from 0..D7FF, E000..10FFFF; the domain of the
> functions which define UTF's. The Unicode scalar value
> definitionally excludes D800..DFFF, which are only code unit
> values used in UTF-16, and which are not code points associated
> with any well-formed UTF code unit sequences.
>
>Assignment (of code points)
>
> Refers to the process of associating abstract character with
> code points. Mathematically a code point is
> "assigned to" an abstract character and an abstract
> character is "mapped to" a code point.
>
> This is distinguished from the vaguer sense of "assigned"
> in general parlance as meaning "a code point given some
> designated function by the standard", which would include
> noncharacters and surrogates.
>
>>
>> So far, so good. Some questions:
>>
>> - are all assigned Unicode characters also abstract characters?
>
>Yes. Or rather: all encoded characters are assigned to abstract
>characters.
>
>(See above for my distinction between "assigned" and
>"designated", which would apply to noncharacters and surrogate
>code points -- neither of which classes of code points get
>assigned to abstract characters.)
>
>>
>> - it seems that not all abstract characters have code points (since
>> abstract characters can be formed using combining characters). Is
>> that correct?
>
>Yes. (Note above -- abstract characters are also a concept which
>applies to other character encodings besides the Unicode Standard,
>and not all encoded characters in other character encodings automatically
>make it into the Unicode Standard, for various architectural reasons.)
>
>>
>> - do <U+00C5> (Å) and <U+0041, U+030A> (A followed by combining ring
>> above) represent the same abstract character?
>
>Yes. That is the implicit claim behind a specification of canonical
>equivalence.
>
>--Ken
>
>>
>> Would be good if someone could clear this up.
>>
>> --
>> Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
>> ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >
>>
>>
>>



This archive was generated by hypermail 2.1.2 : Mon Jul 22 2002 - 15:56:53 EDT