Re: Definition of character

From: Ken Whistler <>
Date: Wed, 13 Jul 2011 17:43:49 -0700

Since Jukka seemed to take issue with my responding to his proffered
by instead bringing up an analogy between "life" and "character", I'll
try responding
directly to the attempted clarifications.

On 7/13/2011 12:45 AM, Jukka K. Korpela wrote:
> That’s a completely different issue. The purpose of definitions and
> consistent use of terms is not to set guidelines for decisions. It
> must be possible to say that a particular text character is not a
> Unicode character without implying (as a naturalistic fallacy of a
> kind) that it should be.

UTC members say stuff like that all the time, without confusion or

   "The characters of the Tangut script are not yet encoded in the
Unicode Standard."

> The entire “definition” of the word “character” in the Unicode
> Glossary is highly confusing, and so is “abstract character.”

"Abstract character" is deliberately aligned with the longstanding SC2
normative definition
of "character". See 10646:

"character: member of a set of elements used for the organization,
control, or
representation of textual data"

That goes way back in the history of SC2. Back to the 8859 series before
10646, and then back
to ISO 2022 before that. There is little point in revising that, as it
would only introduce
a disconnect (and the potential for more confusion).

"character (2)" in the glossary is simply a synonym for "abstract
character". It is what
people talk about when they are talking character encoding theory, as
opposed to
what particular entities are encoded in a particular character encoding.

"character (1)" in the glossary is what you are defining below as "text
character" -- it is an
element of writing, considered independently of any considerations of
character encoding.

"character (3)" in the glossary is what you are defining below as
"Unicode character". Since
not all "abstract characters" are actually encoded in the Unicode
standard, nor are all
"text characters", we need some concept of "characters that are encoded
in the Unicode
Standard". And when the context of Unicode is already implied, that is
almost always
what "character" means, in the documentation or the discussion.

> They would perhaps best be replaced by the following:

Now, as to your particular suggestions:

> Unicode character. A Unicode code point classified to be a character
> code point. It may represent a text character, a component of a text
> character (such as an accent symbol), or a control code for text
> formatting.

"a character code point" is an undefined term here. We can talk about
assigning a code
point to a character (1). If we do so, then that that character becomes
an "encoded character"
(q.v. in the glossary). If that assignation occurs in the Unicode
Standard, then it becomes
a "Unicode encoded character". "Unicode character" is our general
shorthand for
"Unicode encoded character", and we often shorten it just to "character
(3)", because most
of the time it is assumed we are talking about Unicode encoded characters.

"component of a text character" is another undefined term here. It begs
questions of
graphology: why this "component", and not that "component", and what is
a "component"

"It may represent", rather than clarifying, actually muddies the
definitional context here.

Definitionally, a "Unicode encoded character" is an association between
a particular
(Unicode) code point and a particular abstract character. What that
abstract character
itself then represents is beside the point.

> Text character. An element of writing recognized as a basic unit of
> text, such as a letter, digit, punctuation mark, currency symbol, a
> syllable symbol in syllabic writing, or an ideograph. This is a
> non-technical definition, and there are differences in how people
> mentally divide text into text characters or recognize different
> graphic symbols as forms of a text character or as separate text
> characters. A text character is usually representable as a Unicode
> character or as a sequence of Unicode characters.

This definition has problems because it introduces a new term "text
character" that
ordinary people don't actually use, for what ostensibly is the ordinary,
usage of the term "character". It is also itself potentially ambiguous
the intended (but awkward) sense of "text[ual] {attributive} character" and
"character [in or of the] text".

A preferable approach, in my opinion, is to default to the
writing-system-specific terms
for units, when talking about these things: letters, syllables,
sinograms, aksaras, ligatures, etc.,
or the pieces: accent marks, strokes, radicals, components, jamos, etc.
If one wants
a technical cover term for such things, grapheme comes to mind, but if
trying to
explain things to the general public, "things that people think of as
characters" is
the workaround we usually apply.

> Character. A Unicode character or a text character. Normally the
> context makes it clear which one is meant. In the Unicode Standard,
> “character” normally means “Unicode character.”

Actually, I think this would contribute to the naturalistic fallacy you
cited above.

One of the biggest problems that the character encoding committees face is
the assumption by those new to the encoding process that once a
"character" has been identified by a proposal ("X is a character in my
writing system"), that inexorably implies that it should be encoded as
a "character" in Unicode. When of course, then identification of a
in that sense (what the user or community thinks of as a character) is only
the first step in the analysis as to whether the entity in question is an
appropriate abstract character, and then further, as to whether that
character, once clearly identified, actually should be encoded (as a single
"Unicode encoded character").

> (I’m sure this would need clarifications and tuning. I presented it
> mainly to illustrate that clarity is possible.)

And what I've indicated are some of the reasons why I think fiddling
further with the
definition(s) of "character" is likely to lead to further problems,
rather than self-evidently
improve the situation.

Received on Wed Jul 13 2011 - 19:45:22 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 13 2011 - 19:45:22 CDT