Re: Code Point -- What is the integer?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Apr 28 2005 - 12:48:58 CST

  • Next message: Richard Cook: "Unicode in the news"

    > Namaskar and Aloha from the offices of Himalayan Academy Publications
    > in Hawaii...

    Welcome to the Unicode list!

    > 1. The code points described as a simple series of integers from
    >
    > 1 to 1,123,000 (or whatever that last integer is that is equivalent to:
    > U+10FFFF)

    The last decimal number is 1,114,111, FYI. (Easier to remember as 1114111.)

     
    > "Unicode is this just a long series from One to over One Million and
    > there is a character in each place and the whole list includes all the
    > characters of all the languages known to man, past and present."

    Well, the project hasn't been finished. There are characters still
    not in, like Egyptian hieroglyphics. But that is the essence of
    the project, yes.
     
    > 2. but then we move on to: " Unicode characters may be encoded at any
    > code point from U+0000 to U+10FFFF" and now we begin to slide into the
    > "nerd realm"

    Frank provided a nice summary of the justification for why engineers
    prefer hexadecimal representations.

    >
    > I understand "004F" to be the hexadecimal representation for four
    > separate, 4-bit sequences.

    Well, actually one 16-bit sequence: 0000000001001111

    But for readability, that is often broken up into a sequence of
    4-bit sequences called "nibbles": 0000 0000 0100 1111

    >
    > for purposes of a diagram, I would like to translate any given such
    > code point designation like A = U+0041 to its integer position in the
    > series. (aside question: what do you call that kind of "label" for the
    > code point: "U+****"?)

    The Unicode Standard just calls it the "code point".

    The ISO/IEC 10646 international standard calls it the "short identifier
    for code positions".

    The two things mean the same.

    >
    > e.g. expressed verbally, if one were writing an article for "mom and
    > pop"
    >
    > The capital letter A is number "65" in the series... but computer
    > geeks like to express it in hexidecimal form like this, "U+0041" and if
                                  ^^^^^^^^^^^
                                  hexadecimal (often misspelled ;-) )
    > you really need to describe it to the computer then it is "0000 0000
    > 0100 0001"
    >
    > or in a diagram simply
    >
    > A --> 65 --> U+0041 --> 0000 0000 0100 0001
    >
    > And ditto for one Tamil Char and one Chinese character... but my
    > problem is ascertaining the second, simple integer, segement...
    >
    > OK, so my questions are:
    >
    > 1) is the decimal expression for the capital letter A as 65 exactly
    > correspondent to its integer code point position in the total unicode
    > series expressed as as a series of integers?

    Yes.

    >
    > 2) How can one ascertain the integer number for a code point
    > outside-above base ANSI?

    The easiest way is to make use of the calculators that are
    available as desk accessories on almost any computer. (Windows,
    Mac, Solaris, Linux, etc., all have one.)

    On Windows: Programs > Accessories > Calculator

    Set it to "Scientific". Choose "Hex". Type in the hexadecimal
    number (e.g. "BE6"). Hit the "Dec" button, and presto, it
    changes to "3046", which is the decimal number equivalent of
    hexadecimal 0x0BE6.

     
    > So I we want to be able to say, for the layman:
    >
    > "The entire Tamil alphabet is contained between characters 2560 and
    > 2843 in the unicode series" But one need sto

    The block for Tamil is U+0B80..U+0BFF. So if you convert those
    numbers to decimal, the range is: 2944..3071.

    > a) be able find where those blocks are (where do you go to find the
    > blocks beginning and endings for different languages)

    Go to the chart pages that other respondents already pointed
    you to.

    > b) be able to translate "U+0BE6" (which is a position in the Tamil set)
    > back to a simple integer in the series. If I just "do the math* using
    > the same correlation for the Letter A ["0041" = "65"therefore 0BE6 must
    > equal **** ] ... will it be correct?

    Yes. And U+0BE6 --> decimal 3046.

    > I'm hoping I can go somewhere to find this info easily from some
    > tables....

    Just use the calculator accessories. It is easy.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Apr 28 2005 - 12:49:57 CST