Re: Abstract character?

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 22 2002 - 16:38:50 EDT


Lars Marius Garshol asked:

> I'm trying to find out what an abstract character is. I've been
> looking at chapter 3 of Unicode 3.0, without really achieving
> enlightenment.
>
> The term Unicode scalar value (apparently synonymous with code point)
> seems clear. It is the identifying number assigned to assigned
> Unicode characters.

Here is one of my attempts at a more rigorous term rectification:

Abstract character

   that which is encoded; an element of the repertoire (existing
   independent of the character encoding standard, and often
   identifiable in other character encoding standards, as well
   as the Unicode Standard); the implicit basis of transcodings.

   Note that while in some sense abstract characters exist a
   priori by virtue of the nature of the units of various writing
   systems, their exact nature is only pinned down at the point
   that an actual encoding is done. They are not always obvious,
   and many new abstract characters may arise as the result of
   particular textual processing needs that can be addressed by
   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
   etc., etc.)

Code point

   A number from 0..10FFFF; a "point" in the codespace 0..10FFFF.

Encoded character

   An *association* of an abstract character with a code point.

Unicode scalar value

   A number from 0..D7FF, E000..10FFFF; the domain of the
   functions which define UTF's. The Unicode scalar value
   definitionally excludes D800..DFFF, which are only code unit
   values used in UTF-16, and which are not code points associated
   with any well-formed UTF code unit sequences.

Assignment (of code points)

   Refers to the process of associating abstract character with
   code points. Mathematically a code point is
   "assigned to" an abstract character and an abstract
   character is "mapped to" a code point.

   This is distinguished from the vaguer sense of "assigned"
   in general parlance as meaning "a code point given some
   designated function by the standard", which would include
   noncharacters and surrogates.

>
> So far, so good. Some questions:
>
> - are all assigned Unicode characters also abstract characters?

Yes. Or rather: all encoded characters are assigned to abstract
characters.

(See above for my distinction between "assigned" and
"designated", which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)

>
> - it seems that not all abstract characters have code points (since
> abstract characters can be formed using combining characters). Is
> that correct?

Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard, for various architectural reasons.)

>
> - do <U+00C5> (Å) and <U+0041, U+030A> (A followed by combining ring
> above) represent the same abstract character?

Yes. That is the implicit claim behind a specification of canonical
equivalence.

--Ken

>
> Would be good if someone could clear this up.
>
> --
> Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
> ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >
>
>
>



This archive was generated by hypermail 2.1.2 : Mon Jul 22 2002 - 14:53:44 EDT