Re: Code Point -- What is the integer?

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu Apr 28 2005 - 12:03:38 CST

  • Next message: Kenneth Whistler: "Re: Code Point -- What is the integer?"

    On Wed, 27 Apr 2005, Sivakatirswami wrote:

    > "Unicode is this just a long series from One to over One Million and
    > there is a character in each place and the whole list includes all the
    > characters of all the languages known to man, past and present."

    That sounds like a useful "visualization", but it is not quite correct.
    It's a good starting point for an analysis:

    Unicode is an evolving standard, and new characters are added to it.
    It contains almost all characters used in living languages and writing
    systems, but not all historic characters or characters used in
    special notations (mathematics etc.). Besides, not all characters have a
    code point as such; some characters containing a diacritic mark can only
    be written as decomposed, i.e. as a base character followed by one or more
    combining diacritic marks.

    Not all places (code points) contain a character - most code points are
    currently unassigned, and some are explicitly defined as noncharacters.

    > I understand "004F" to be the hexadecimal representation for four
    > separate, 4-bit sequences.

    No, it is just a different (base 16) notation for an integer, and it
    postulates no particular implementation at bit level. It's simply a
    numeral. Unicode (and other character standards) mostly used hexadecimal
    notation for code points, partly due to the structure of the coding space.

    A word of warning: although characters are identified by their code
    points, which are numbers (unsigned integers), the _numeric_ (arithmetic)
    value is usually irrelevant. That is, we mostly don't operate on them as
    numbers, with arithmetic operations. For most purposes, the numbers are
    just indexes. For instance, if a character's code point is numerically
    smaller than another character's code point, this implies in general
    nothing about the mutual order of the _characters_ in alphabet or sorting
    order. (It is more or less a coincidence that _some_ characters have code
    points that correspond to their mutual alphabetic order.)

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Thu Apr 28 2005 - 12:04:27 CST