Re: Code Point -- What is the integer?

From: Hans Aberg (
Date: Fri Apr 29 2005 - 04:01:11 CST

  • Next message: Jukka K. Korpela: "Re: Code Point -- What is the integer?"

    At 21:26 -0700 2005/04/28, Asmus Freytag wrote:
    >I think the encoding model used by Unicode is reasonably well
    >presented in Unicode Technical Report #17: "Character Encoding
    >Model" If you think that
    >presentation should be improved, I invite you to file a specific
    >suggestion using the online reporting form.

    This is essentially OK. You have a practical problem of bringing it
    out to the public, it seems. :-)

    The things I would have done somewhat differently, as a
    mathematician, is to develop it around a group of separate concepts,
    then linking them together, rather than throwing the different pieces
    altogether in one lump.

    For example, I would no have use the word "character" everywhere, and
    used the word "set" for a collection of something, rather than
    different words like "repertoire". So, "abstract character set" seems
    better than "abstract character repertoire" seems better in a
    technical definition, although the latter term might be used
    informally. Then, by Bourbaki "abuse of language", accepting to drop
    the word "abstract" when the context is clear, what you call "Coded
    Character Set", I would have called "character set numbering". There
    is also a mathematical difference between
       a mapping from an abstract character repertoire to a set of nonnegative
       a mapping from an abstract character repertoire to the set of nonnegative
    In modern formal mathematical language, a function comes with both
    domain and codomain. Even though Unicode probably thinks of having
    this codomain fixed and finite, it suffices in this context to have
    it to be the set of non-negative integers (i.e., the set of natural
    numbers). Then you have
       Character Encoding Form
         a mapping from a set of nonnegative integers that are elements of
    a CCS to a
         set of sequences of particular code units of some specified width, such as
         32-bit integers
       Character Encoding Scheme
         a reversible transformation from a set of sequences of code units (from one
         or more CEFs to a serialized sequence of bytes
    Here I would have inserted the concept of an integer to binary
    transformation (function, map), which does not as such have anything
    with characters to do. One gets a character encoding when combining
    the character numbering map with a integer to binary transformation.
    Also, the wording "[the] integers that are elements of a CCS" is
    formally incorrect, as they are part of the range (i.e., map image)
    of the CCS; so it should have been "[the] integers that are in the
    range of a CCS". From the definitions, it is hard to immediately see
    the difference between "Character Encoding Form" and "Character
    Encoding Scheme"; it appears that the former means that the codomain
    of the character number map has been fixed, whereas the latter means
    an integer to binary encoding with restricted domain. Also, does the
    word "reversible" indicate that the map is invertible or injective
    (one-to-one)? If the map is injective, then the inverse image of
    every singleton is a singleton, so that the character sequence can be
    extracted from the encoded text. Then, as there are many character
    maps involved, I would given your "character map" notion a more
    descriptive name, "character [to binary] encoding", ie.e, what you
    get when combining the two maps, the character numbering map, and the
    integer to binary map. You can insert notions of domain and codomain
    restrictions here, but the final map, the character encoding map,
    will of course be the same if the original character set is not cut
    off in the process.

    In short, there is nothing wrong with the model itself, but there are
    some problems in focusing it logically, and in its definitions.

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Fri Apr 29 2005 - 04:04:13 CST