Lars Marius Garshol asked:
> I'm trying to find out what an abstract character is. I've been
> looking at chapter 3 of Unicode 3.0, without really achieving
> The term Unicode scalar value (apparently synonymous with code point)
> seems clear. It is the identifying number assigned to assigned
> Unicode characters.
Here is one of my attempts at a more rigorous term rectification:
that which is encoded; an element of the repertoire (existing
independent of the character encoding standard, and often
identifiable in other character encoding standards, as well
as the Unicode Standard); the implicit basis of transcodings.
Note that while in some sense abstract characters exist a
priori by virtue of the nature of the units of various writing
systems, their exact nature is only pinned down at the point
that an actual encoding is done. They are not always obvious,
and many new abstract characters may arise as the result of
particular textual processing needs that can be addressed by
characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
A number from 0..10FFFF; a "point" in the codespace 0..10FFFF.
An *association* of an abstract character with a code point.
Unicode scalar value
A number from 0..D7FF, E000..10FFFF; the domain of the
functions which define UTF's. The Unicode scalar value
definitionally excludes D800..DFFF, which are only code unit
values used in UTF-16, and which are not code points associated
with any well-formed UTF code unit sequences.
Assignment (of code points)
Refers to the process of associating abstract character with
code points. Mathematically a code point is
"assigned to" an abstract character and an abstract
character is "mapped to" a code point.
This is distinguished from the vaguer sense of "assigned"
in general parlance as meaning "a code point given some
designated function by the standard", which would include
noncharacters and surrogates.
> So far, so good. Some questions:
> - are all assigned Unicode characters also abstract characters?
Yes. Or rather: all encoded characters are assigned to abstract
(See above for my distinction between "assigned" and
"designated", which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)
> - it seems that not all abstract characters have code points (since
> abstract characters can be formed using combining characters). Is
> that correct?
Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard, for various architectural reasons.)
> - do <U+00C5> (Å) and <U+0041, U+030A> (A followed by combining ring
> above) represent the same abstract character?
Yes. That is the implicit claim behind a specification of canonical
> Would be good if someone could clear this up.
> Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
> ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >
This archive was generated by hypermail 2.1.2 : Mon Jul 22 2002 - 14:53:44 EDT