Code Point -- What is the integer?

From: Sivakatirswami (katir@hindu.org)
Date: Wed Apr 27 2005 - 22:43:45 CST

  • Next message: Simon Montagu: "Re: Code Point -- What is the integer?"

    Namaskar and Aloha from the offices of Himalayan Academy Publications
    in Hawaii...

    Where we are just slowly learning about Unicode in our publications
    work..

    I'm writing a short article on Unicode in a "public" magazine (Hinduism
    Today) about Mac OSX Tiger ((10.4) support for Tamil Unicode...

    I need to get down to a very layman's level and only have a very small
    space allotment.

    Despite reading all the documents ( I downloaded *all* the PDF's for
    the 4.0 standard book) I *still* have trouble getting my head around
    the difference between

    1. The code points described as a simple series of integers from

    1 to 1,123,000 (or whatever that last integer is that is equivalent to:
      U+10FFFF)

    This being the simplest way a layman can visualize it, albeit the
    latter number is big... it still easy to describe and visualize
    (roughly of course) as in:

      "Unicode is this just a long series from One to over One Million and
    there is a character in each place and the whole list includes all the
    characters of all the languages known to man, past and present."

    Which of course sounds at the very least "cool" for the glib-minded and
    incredibly ground breaking for those who can see it for what it is...
    (if true, which it seems to be...)

    2. but then we move on to: " Unicode characters may be encoded at any
    code point from U+0000 to U+10FFFF" and now we begin to slide into the
    "nerd realm"

    I understand "004F" to be the hexadecimal representation for four
    separate, 4-bit sequences.

    for purposes of a diagram, I would like to translate any given such
    code point designation like A = U+0041 to its integer position in the
    series. (aside question: what do you call that kind of "label" for the
    code point: "U+****"?)

    e.g. expressed verbally, if one were writing an article for "mom and
    pop"

    The capital letter A is number "65" in the series... but computer
    geeks like to express it in hexidecimal form like this, "U+0041" and if
    you really need to describe it to the computer then it is "0000 0000
    0100 0001"

    or in a diagram simply

    A --> 65 --> U+0041 --> 0000 0000 0100 0001

    And ditto for one Tamil Char and one Chinese character... but my
    problem is ascertaining the second, simple integer, segement...

    OK, so my questions are:

    1) is the decimal expression for the capital letter A as 65 exactly
    correspondent to its integer code point position in the total unicode
    series expressed as as a series of integers?

    2) How can one ascertain the integer number for a code point
    outside-above base ANSI?

    e.g. in the diagram I want to put an English char, a Tamil chara and a
    Chinese character...

    So I we want to be able to say, for the layman:

    "The entire Tamil alphabet is contained between characters 2560 and
    2843 in the unicode series" But one need sto

    a) be able find where those blocks are (where do you go to find the
    blocks beginning and endings for different languages)
    b) be able to translate "U+0BE6" (which is a position in the Tamil set)
      back to a simple integer in the series. If I just "do the math* using
    the same correlation for the Letter A ["0041" = "65"therefore 0BE6 must
    equal **** ] ... will it be correct?

    I'm hoping I can go somewhere to find this info easily from some
    tables....

    TIA!

    Sannyasin Sivakatirswami
    Himalayan Academy Publications
    at Kauai's Hindu Monastery
    katir@hindu.org

    www.HimalayanAcademy.com,
    www.HinduismToday.com
    www.Gurudeva.org
    www.Hindu.org



    This archive was generated by hypermail 2.1.5 : Thu Apr 28 2005 - 09:08:12 CST