RE: 'code unit' and 'code point' meaning check

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 14 2003 - 18:45:07 EDT

  • Next message: Yael.Aharon@nokia.com: "Unicode conformant character encodings and us-ascii"

    Rick Cameron asked:

    > (I'm sure this is an FAQ - but why are the code points 0xd800-0xdfff not
    > considered noncharacters? Obviously no abstract character can be associated
    > with them! Is there a different term that describes code points like this?)

    It's not in the online preview Chapter 3, but rather in Chapter 2
    of Unicode 4.0. (Incidentally, the editors are trying to get
    preview versions of Chapters 1 and 2 posted, as well, to help
    out with questions like this while we are waiting for the actual
    publication of the book from Addison-Wesley.)

    The answer is that in the new scheme for 4.0, the Unicode Technical Committee
    has decided on a nomenclature that divides code points into
    7 basic types (gc refers to General Category property values):

    1. Graphic (gc = [L, M, N, P, S, Zs])

    2. Format (gc = [Cf, Zl, Zp])

    3. Control (gc = Cc)

    4. Private-use (gc = Co)

    5. Surrogate (gc = Cs)

    6. Noncharacter (gc = Cn, in part)

    7. Reserved (gc = Cn, in part)

    Types 1-4 are considered *assigned* to abstract characters.
    Types 5-7 are considered *not assigned* to abstract characters.

    Types 1-6 are considered *designated* code points (which means
      that the standard specifies something normative about their
      usage).
    Type 7 are considered *undesigned* code points (which means they
      are reserved for future use, and in principle could be turned
      into any of types 1-4 or 6 by future changes).
      
    Type 4, Private-use code points, are somewhat odd, in that they
    are considered assigned to abstract characters, but the abstract
    characters are *truly* abstract, i.e., essentially, private use
    character #1, private use character #2, ..., and the standard
    gives them no further semantic interpretation. But the convention
    was chosen because implementations are more robust if they treat
    all the private-use code points as if they had characters assigned
    to them, rather than as if they were just reserved.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 19:31:35 EDT