Definition of character: Exegesis of SC2 nomenclature

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 09 2002 - 19:16:24 EDT


One possibly interesting thing derived from "the threads from hell"
is the notion that the definition of character offered in the
various ISO JTC1/SC2 character encoding standards and TR's such
as the Character-Glyph Model (TR 15825) may be leading people astray
about what is appropriate to encode as a character.

Here is an attempt at an exegesis.

The standard SC2 definition of a character is:

"A member of a set of elements used for the organization, control, or
representation of data."

[Quoted from ISO/IEC 10646, Clause 4 Terms and definitions, but you
can find the same definition in other SC2 standards, including each
part of ISO/IEC 8859, and in ISO/IEC 2022.]

The *reason* why SC2 chose such a strange and seemingly open-ended
definition was *not* to invite arbitrarily strange collections of
data control elements to be encoded as characters, but rather an
attempt, in a procrustean way, to get the definition to fit the
reality.

In the ISO 2022 architectural framework for character encodings,
specific character set definitions are declared as consisting of
one or more sets of graphic characters (G0 and G1 sets) and one
or more sets of control functions (C0 and C1 sets), where the
graphic characters come from registered (graphic) character encodings
and where the control functions come from registered control function
sets. The graphic character encodings are the typical character
encodings we are familiar with, of which ISO/IEC 8859-1 ("Latin-1")
is a prototypical example -- a bunch of visible letters, digits,
punctuation, and symbols. The control function sets are small sets
of functions designed for the manipulation and control of characters
in various device contexts (mostly terminal hardware), and consist
of things like line advance, moving the cursor back and forwards,
indicating start and end of transmission context, marking string
delimitations, and the like. The best known of these control function
sets is defined in ISO 6429, and its C0 set is also grandfathered
into ASCII as the familiar ASCII "control codes" -- the same
codes that are listed in Unicode as aliases for U+0000..U+001F
(U+0000 "null", U+0001, "start of heading", ... U+0008 "backspace",
U+0009 "tab", ... etc.)

Note that the control functions are not just any imaginable set of
functions -- they are functions designed by people interested in
controlling characters on existing classes of output display devices
(terminals and teletypes, primarily). And not all terminal control
functions were defined as control functions in these sets, either.
Large classes of such functions were left up to vendor implementation,
and made use of ESC(ape) sequences for their initiation.

In the context of SC2 character encoding standards, a cover term
for "character" was needed which was broad enough to deal with the
existing, "on the ground" implementation fact that systems included
graphic characters *and* control characters mixed in character data
streams. The graphic characters were conceived of as representing
the content of text, primarily. And the then-existing usage of
control characters was primarily to "organize" and "control" the
representation of such data, by establishing line breaks, page
breaks, string or other text unit delimitations, backspacing, and
the like. Hence the committee compromise definition of "character"
quoted above.

That definition should be understood in the context of this history,
however. It is not legal license for intentional or unintentional
misunderstandings of the appropriate scope of character encodings,
which should be focussed on textual content, together with the
minimal additional format control specification required for text
organization.

Modern text representational practice, in a world that has
mostly abandoned character terminal display to niche and legacy uses, and
which instead uses graphic displays and image models, combined with
rasterizing of outline fonts for textual display, has essentially
made most of the ISO 6429 control functions obsolete. The Unicode
Standard only specifies the few control functions that have survived
into modern plain text handling conventions: CR, LF, FF, and tabs,
among them. On the other hand, the Unicode plain text model has
necessitated the addition of new format control characters that
were not envisioned in the terminal control function sets, or which
were organized differently for them. A good case in point are the
various Unicode bidi control format characters, which are used for
the bidirectional algorithm to override default implicit bidi
ordering for various edge cases. Those differ from the bidirectional
formatting control functions which were earlier designed for use
on designated character terminals, with fixed-size cells and fixed
line widths, for laying out visual order bidi text legibly via
control of cursor position and direction when fed a serial byte
stream to be laid out.

Note that in any case, the old control functions (aimed at serial
output devices) and the new Unicode format control characters
(aimed at rendering of text in a graphic text layout + fonts
world) are focussed on *text* related data processing.

Now one could start from the SC2 definition, without understanding
of all this context, and conclude that the *data* should cover
weather reports on Mars, and that the organization, control, and
representation of those weather reports should include "characters"
to reprogram specific barometric instruments on robotic landers
and to lay out 3-dimensional colorized diagrams of Martian
sandstorms and weather fronts. However, I can assure you that was
not and is not the intent of SC2 in the definition of character.

It is a misconstrual of the intent of the character encoding
committees to start advocating the encoding of such things as
color characters or:

U+F390 MULTIPY THE CONTENTS OF THE ACCUMULATOR REGISTER BY THE BASE
         VALUE AND ADD DECIMAL 0

The latter is indeed an example of organizing or controlling data,
but is hopelessly out of scope for a *character* encoding, since
it takes a field of data control (computer machine instructions)
totally unrelated to textual content and attempts to catalog it
as a set of characters. It should not be surprising when a list
devoted to the discussion of character encoding, embedded in a
long historical context of what the scope for such encodings
appropriately entails, should greet proposals such as the
above with derision rather than approbation.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 09 2002 - 17:31:51 EDT