**From:** Hans Aberg (*haberg@math.su.se*)

**Date:** Thu Apr 28 2005 - 15:17:56 CST

**Previous message:**Sivakatirswami: "Re: Code Point -- What is the integer?"**In reply to:**Sivakatirswami: "Code Point -- What is the integer?"**Next in thread:**Jon Hanna: "RE: Code Point -- What is the integer?"**Reply:**Jon Hanna: "RE: Code Point -- What is the integer?"**Reply:**Asmus Freytag: "Re: Code Point -- What is the integer?"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

At 18:43 -1000 2005/04/27, Sivakatirswami wrote:

*>OK, so my questions are:
*

*>
*

*>1) is the decimal expression for the capital letter A as 65
*

*>exactly correspondent to its integer code point position in the
*

*>total unicode series expressed as as a series of integers?
*

*>
*

*>2) How can one ascertain the integer number for a code point
*

*>outside-above base ANSI?
*

Unicode lacks a clear underlying mathematical model, or at least a

clear description of it. So here is what it should be, regardless

what it actually is :-):

First there is a intuitive notion of an "abstract character". Unicode

tries to collect such abstract characters; most often they are just

called "characters". Then, given a specific Unicode character, there

are essentially two ways to identify it. One is via its character

name, which is a finite string of metacharacters A-Z and " " (space);

here, "meta" indicates that these are to be viewed as Unicode

abstract characters, which are outside the Unicode character set. The

second way is by a non-negative integer, which is called the "code

point", but which I prefer to call a character number. Likewise, this

number is "meta" because it is not only outside the Unicode character

set, but also outside any actual computer representation of this

number. It is purely abstract. In order to represent the abstract

characters inside a computer using the character numbers, as the

computer works with binary numbers, one needs to introduce an integer

to binary translation scheme, which is called an "encoding". Here it

gets tricky, because Unicode bundles the character number and various

integer to binary translation schemes together into single logical

entities called "character encodings", which go under the names

UTF-8/16/32.

So now to your questions: The Unicode character "A" has the character

name "LATIN CAPITAL LETTER A", and the character number (or code

point) 65; the latter is just an integer, and you may represent it as

you want. When one writes U+x_1...x_k, that is really a notation

meaning "the Unicode character having character number x_1...x_k in

hexadecimal notation". In your example, the hexadecimal number 41 is

the same as the decimal number 65. So they represent the same

character. Still, these are just abstract numbers. In order to get it

into a computer, one must find a binary representation. In UTF-8, 65

is represented as a binary number 01000001. Such binary numbers can

easily be written using hexadecimal numbers, in which case it is 41.

The clever thing here is that the orginal ASCII characters have

Unicode numbers in such a way that in UTF-8, they get the same binary

representation as in ASCII. But for other characters, there is no

such representation. In the encodings UTF-16 and UTF-32, one get the

same result, if one on forgets about the leading bytes with value 0.

-- Hans Aberg

**Next message:**Jon Hanna: "RE: Code Point -- What is the integer?"**Previous message:**Sivakatirswami: "Re: Code Point -- What is the integer?"**In reply to:**Sivakatirswami: "Code Point -- What is the integer?"**Next in thread:**Jon Hanna: "RE: Code Point -- What is the integer?"**Reply:**Jon Hanna: "RE: Code Point -- What is the integer?"**Reply:**Asmus Freytag: "Re: Code Point -- What is the integer?"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Thu Apr 28 2005 - 15:19:12 CST
*