Re: Code Point -- What is the integer?

From: Hans Aberg (haberg@math.su.se)
Date: Thu Apr 28 2005 - 15:17:56 CST

  • Next message: Jon Hanna: "RE: Code Point -- What is the integer?"

    At 18:43 -1000 2005/04/27, Sivakatirswami wrote:
    >OK, so my questions are:
    >
    >1) is the decimal expression for the capital letter A as 65
    >exactly correspondent to its integer code point position in the
    >total unicode series expressed as as a series of integers?
    >
    >2) How can one ascertain the integer number for a code point
    >outside-above base ANSI?

    Unicode lacks a clear underlying mathematical model, or at least a
    clear description of it. So here is what it should be, regardless
    what it actually is :-):

    First there is a intuitive notion of an "abstract character". Unicode
    tries to collect such abstract characters; most often they are just
    called "characters". Then, given a specific Unicode character, there
    are essentially two ways to identify it. One is via its character
    name, which is a finite string of metacharacters A-Z and " " (space);
    here, "meta" indicates that these are to be viewed as Unicode
    abstract characters, which are outside the Unicode character set. The
    second way is by a non-negative integer, which is called the "code
    point", but which I prefer to call a character number. Likewise, this
    number is "meta" because it is not only outside the Unicode character
    set, but also outside any actual computer representation of this
    number. It is purely abstract. In order to represent the abstract
    characters inside a computer using the character numbers, as the
    computer works with binary numbers, one needs to introduce an integer
    to binary translation scheme, which is called an "encoding". Here it
    gets tricky, because Unicode bundles the character number and various
    integer to binary translation schemes together into single logical
    entities called "character encodings", which go under the names
    UTF-8/16/32.

    So now to your questions: The Unicode character "A" has the character
    name "LATIN CAPITAL LETTER A", and the character number (or code
    point) 65; the latter is just an integer, and you may represent it as
    you want. When one writes U+x_1...x_k, that is really a notation
    meaning "the Unicode character having character number x_1...x_k in
    hexadecimal notation". In your example, the hexadecimal number 41 is
    the same as the decimal number 65. So they represent the same
    character. Still, these are just abstract numbers. In order to get it
    into a computer, one must find a binary representation. In UTF-8, 65
    is represented as a binary number 01000001. Such binary numbers can
    easily be written using hexadecimal numbers, in which case it is 41.
    The clever thing here is that the orginal ASCII characters have
    Unicode numbers in such a way that in UTF-8, they get the same binary
    representation as in ASCII. But for other characters, there is no
    such representation. In the encodings UTF-16 and UTF-32, one get the
    same result, if one on forgets about the leading bytes with value 0.

    -- 
       Hans Aberg
    


    This archive was generated by hypermail 2.1.5 : Thu Apr 28 2005 - 15:19:12 CST