Re: Code Point -- What is the integer?

From: Hans Aberg (
Date: Fri Apr 29 2005 - 05:56:56 CST

  • Next message: Jukka K. Korpela: "Re: Code Point -- What is the integer?"

    At 13:28 +0300 2005/04/29, Jukka K. Korpela wrote:
    > > The things I would have done somewhat differently, as a
    >> mathematician, is to develop it around a group of separate concepts,
    >> then linking them together, rather than throwing the different pieces
    >> altogether in one lump.
    >Having a mathematical background, I have somewhat similar thoughts on the
    >character concept in Unicode. But we must remember that Unicode tries to
    >cover issues of human behavior and understanding, which are
    >("unfortunately", some people might add) not quite rigorously

    The reason for the current Unicode terminology is probably not that,
    but that is has been developed empirically over the years, without
    distinct formalization. The method of modern math is clear: First
    make a clear logical definition, but then also supply intuitive user

    > > For example, I would no have use the word "character" everywhere, and
    >> used the word "set" for a collection of something, rather than
    >> different words like "repertoire".
    >The word "set" was already in use for an ordered and often coded

    In math, that would be an ordered set or a sequence.

    > > So, "abstract character set" seems
    >> better than "abstract character repertoire"
    >In some context maybe, but my main problem now, with the definitions, is
    >the multitude of ways in which the word "character" and the expression
    >"abstract character" are used.

    This seems to be a problem: The word "character" is thrown in just
    about everywhere, even in contexts where it is not needed.

    >Moreover, does "abstract character
    >repertoire" parse as repertoire of abstract characters or as a character
    >repertoire that is abstract?

    One should note that defining it as "abstract character sets" does
    not exclude the use of the word word "repertoire, either as an
    informal, supportive notion, or perhaps as a formal definition: 'An
    "abstract character repertoire" is an abstract character set
    satisfying ...'

    >Here's what the Unicode standard itself says in its glossary:
    >It describes the term "character" in different meanings.
    >The first one is: "The smallest component of written language that has
    >semantic value; refers to the abstract meaning and/or shape, rather than a
    >specific shape (see also glyph), though in code tables some form of visual
    >representation is essential for the reader's understanding." The second
    >meaning is that "character" is synonym for "abstract character". which is
    >defined as "a unit of information used for the organization, control, or
    >representation of textual data".
    >The most obvious difference between character and abstract character seems
    >to be that an abstract character could be a control function (say, newline
    >or ESC), whereas a character is what many people call a graphic character
    >or a printable character. But I don't think such a distinction is drawn

    An abstract character, as opposed to a character, is a formal concept
    within the Unicode standard. This is fact mentioned in the
    "The word abstract means that these objects are defined by convention."
    Again, the problem seems to be that these definitions and concepts
    are spread a bit everywhere in the multitude of Unicode documents

    If one should attempt to define the concept of an abstract it, I
    noticed here independently that is seems to be a linguistic semantic
    unit that in some sense is atomic. Let's call that a semantic
    abstract character. Then Unicode also supplies other abstract
    characters, for example those that are used in rendering. Perhaps
    these should be called rendering abstract characters. A rendering
    abstract character need not be glyph, for example if it is used only
    to indicate layout. There are probably more types of abstract
    characters. For example, when inputting minus, hyphen or a dash,
    often a single "-" can be used. So there should perhaps be a notion
    of input abstract characters. And so on. Unicode mixes all these
    together in the notion of "abstract characters", without explicitly
    clearly separating them.

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Fri Apr 29 2005 - 06:00:00 CST