Re: 'code unit' and 'code point' meaning check

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 14 2003 - 17:58:29 EDT

  • Next message: Rick Cameron: "RE: 'code unit' and 'code point' meaning check"

    Ben,

    > could someone confirm if i've got this correct, or not please?:
    >
    > a 'code unit' could be the same as a 'code point', but there again it
    > might not be. it's possible that several 'code units' are required to
    > make up a 'code point'? (so code units can be the same size or smaller
    > than a code point, but not the other way round)?

    Think of it this way.

    The code *point* is a number in the codespace, used to encode
    an abstract character. For Unicode, it is a number in the
    range 0x0000..0x10FFFF (or think of it as 0..1,114,111 expressed
    in decimal). These get expressed with the U+ notation in Unicode.
    Thus U+0041 is the code point for LATIN CAPITAL LETTER A.

    The code *unit* is a fixed-width integral data type used in the
    context of a particular encoding form. The encoded character is
    represented in that encoding form by either a single code unit
    or a sequence of code units.

    In UTF-8, the code unit is always an 8-bit integer. (0x00..0xFF)
    In UTF-16, the code unit is always a 16-bit integer. (0x0000..0xFFFF)
    In UTF-32, the code unit is always a 32-bit integer.
        (0x00000000..0x0010FFFF)
        
    Code units don't "make up a code point".

    Rather, a sequence of one or more code units is used to
    represent a Unicode encoded character in a particular encoding form.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 18:33:26 EDT