# Re: Code Point -- What is the integer?

From: Hans Aberg (haberg@math.su.se)
Date: Fri Apr 29 2005 - 05:56:56 CST

• Next message: Jukka K. Korpela: "Re: Code Point -- What is the integer?"

At 13:28 +0300 2005/04/29, Jukka K. Korpela wrote:
> > The things I would have done somewhat differently, as a
>> mathematician, is to develop it around a group of separate concepts,
>> then linking them together, rather than throwing the different pieces
>> altogether in one lump.
>
>Having a mathematical background, I have somewhat similar thoughts on the
>character concept in Unicode. But we must remember that Unicode tries to
>cover issues of human behavior and understanding, which are
>("unfortunately", some people might add) not quite rigorously
>formalizable.

The reason for the current Unicode terminology is probably not that,
but that is has been developed empirically over the years, without
distinct formalization. The method of modern math is clear: First
make a clear logical definition, but then also supply intuitive user
concepts.

> > For example, I would no have use the word "character" everywhere, and
>> used the word "set" for a collection of something, rather than
>> different words like "repertoire".
>
>The word "set" was already in use for an ordered and often coded
>collection.

In math, that would be an ordered set or a sequence.

> > So, "abstract character set" seems
>> better than "abstract character repertoire"
>
>In some context maybe, but my main problem now, with the definitions, is
>the multitude of ways in which the word "character" and the expression
>"abstract character" are used.

This seems to be a problem: The word "character" is thrown in just
about everywhere, even in contexts where it is not needed.

>Moreover, does "abstract character
>repertoire" parse as repertoire of abstract characters or as a character
>repertoire that is abstract?

One should note that defining it as "abstract character sets" does
not exclude the use of the word word "repertoire, either as an
informal, supportive notion, or perhaps as a formal definition: 'An
"abstract character repertoire" is an abstract character set
satisfying ...'

>Here's what the Unicode standard itself says in its glossary:
>It describes the term "character" in different meanings.
>The first one is: "The smallest component of written language that has
>semantic value; refers to the abstract meaning and/or shape, rather than a
>specific shape (see also glyph), though in code tables some form of visual
>representation is essential for the reader's understanding." The second
>meaning is that "character" is synonym for "abstract character". which is
>defined as "a unit of information used for the organization, control, or
>representation of textual data".
>
>The most obvious difference between character and abstract character seems
>to be that an abstract character could be a control function (say, newline
>or ESC), whereas a character is what many people call a graphic character
>or a printable character. But I don't think such a distinction is drawn
>systematicallyt

An abstract character, as opposed to a character, is a formal concept
within the Unicode standard. This is fact mentioned in the
http://www.unicode.org/reports/tr17/
"The word abstract means that these objects are defined by convention."
Again, the problem seems to be that these definitions and concepts
are spread a bit everywhere in the multitude of Unicode documents
produced.

If one should attempt to define the concept of an abstract it, I
noticed here independently that is seems to be a linguistic semantic
unit that in some sense is atomic. Let's call that a semantic
abstract character. Then Unicode also supplies other abstract
characters, for example those that are used in rendering. Perhaps
these should be called rendering abstract characters. A rendering
abstract character need not be glyph, for example if it is used only
to indicate layout. There are probably more types of abstract
characters. For example, when inputting minus, hyphen or a dash,
often a single "-" can be used. So there should perhaps be a notion
of input abstract characters. And so on. Unicode mixes all these
together in the notion of "abstract characters", without explicitly
clearly separating them.

```--
Hans Aberg
```

This archive was generated by hypermail 2.1.5 : Fri Apr 29 2005 - 06:00:00 CST