Definition of character

From: Asmus Freytag <>
Date: Tue, 12 Jul 2011 09:57:33 -0700


reminding everyone of the definition of "technical term" as opposed to a
word in everyday language isn't helping address the underlying issue.
Everyone is familiar with this distinction.

You note that there's a bit of a truism that underlies the definition of
character and character encoding, but I would claim this is not limited
to Unicode, and has nothing to do with promoting that standard. The
truism goes like this: "A character is what character encodings encode".

As such, "character" also becomes the smallest unit on which algorithms
for processing textual data operate.

Historically, character encodings have also encoded, on otherwise equal
footing, units that are intended for device control. Over time, some of
the device control characters have been redefined as indicators of
logical division of text. (TAB and LF are the most prominent examples of
this evolution).

These historical developments have left us with this and other examples
of deep ambiguities in the definition of the members of those sets we
call "character encodings". These ambiguities are reflected in the
technical (as opposed to everyday) usage of the term "character". I
fully agree with Ken that you can't "fix" this situation be definitional

Let's look at the putative benefit of a better definition. I think such
a benefit has implicitly been claimed to exist, but I would ask for a
demonstration in this case.

One possible benefit of a solid definition of the members of a set is in
helping decide which additional entities should be made members of the
set. Can there be a definition of "character" that provides a solid
guidepost for evaluating future proposed character additions to the

Over twenty years of work on the Unicode Standard (and decades of work
on earlier standards) have clearly demonstrated that it is impossible to
devise an "algorithm" for deciding the question of what candidates are
worthy for being encoded in Unicode (or any other character encoding).

The problem goes back to the incredible diversity of writing systems and
notations and their use. It is further complicated by the fact that
breaking down a writing system into elements (identifying the
characters) can quite often be done in more than one way. In many
instances it's not even obvious which method is the "best" in a given
circumstance. Attempts to base this process on mechanistic rules (driven
by definitions) are bound to fail.

Hence, "characters" are the outcome of a creative (human) process of
analyzing writing systems. Once you have made a particular analysis,
usually ending in an encoding, the elements thus defined are "de facto"
the "characters".

If you were to accept that it is impossible to rigorously define
characters for purposes of making this analysis, the problem becomes
simpler. "Abstract" characters are then entities encoded in one (or
more) character encodings, and "character" is what character encodings
encode. Operationally, characters are "the smallest units operated on by
algorithms that process textual data".

"Operated on" would sidestep the distinctions between characters that
represent elements of a writing system like "A" and what Unicode calls
format controls like "RLM" (or the segmentation characters like "PS",
"LF", "TAB").

A bit is not the smallest unit, because the algorithms (as logically
described) don't operate on bits, they are defined in terms of
characters (or sequences of characters).

For a fuller definition you might need to make clear that "display" is
covered by "process" and you might find you need to find a way to cover
the traditional use of control characters. They could be described the
smallest units operated on by algorithms that control of devices
displaying text based on data embedded in a text stream.

While there might be some improvement in rewording the glossary entries
in this way, doing so neither removes the inherent tautology nor does it
eliminate the fact that characters are very diverse in what they represent.

But it might make clear that no definition of "character" will ever be
sufficient to serve as input to the process of deciding the question of
whether a proposed new entity is or isn't a character.

Received on Tue Jul 12 2011 - 12:03:18 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 12 2011 - 12:03:20 CDT