RE: 'code unit' and 'code point' meaning check

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 14 2003 - 18:45:07 EDT

Next message: Yael.Aharon@nokia.com: "Unicode conformant character encodings and us-ascii"

Previous message: Ben Dougall: "Re: 'code unit' and 'code point' meaning check"
Maybe in reply to: Ben Dougall: "'code unit' and 'code point' meaning check"
Next in thread: Philippe Verdy: "Fw: 'code unit' and 'code point' meaning check"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Rick Cameron asked:

> (I'm sure this is an FAQ - but why are the code points 0xd800-0xdfff not
> considered noncharacters? Obviously no abstract character can be associated
> with them! Is there a different term that describes code points like this?)

It's not in the online preview Chapter 3, but rather in Chapter 2
of Unicode 4.0. (Incidentally, the editors are trying to get
preview versions of Chapters 1 and 2 posted, as well, to help
out with questions like this while we are waiting for the actual
publication of the book from Addison-Wesley.)

The answer is that in the new scheme for 4.0, the Unicode Technical Committee
has decided on a nomenclature that divides code points into
7 basic types (gc refers to General Category property values):

1. Graphic (gc = [L, M, N, P, S, Zs])

2. Format (gc = [Cf, Zl, Zp])

3. Control (gc = Cc)

4. Private-use (gc = Co)

5. Surrogate (gc = Cs)

6. Noncharacter (gc = Cn, in part)

7. Reserved (gc = Cn, in part)

Types 1-4 are considered *assigned* to abstract characters.
Types 5-7 are considered *not assigned* to abstract characters.

Types 1-6 are considered *designated* code points (which means
  that the standard specifies something normative about their
  usage).
Type 7 are considered *undesigned* code points (which means they
  are reserved for future use, and in principle could be turned
  into any of types 1-4 or 6 by future changes).

Type 4, Private-use code points, are somewhat odd, in that they
are considered assigned to abstract characters, but the abstract
characters are *truly* abstract, i.e., essentially, private use
character #1, private use character #2, ..., and the standard
gives them no further semantic interpretation. But the convention
was chosen because implementations are more robust if they treat
all the private-use code points as if they had characters assigned
to them, rather than as if they were just reserved.

--Ken

Next message: Yael.Aharon@nokia.com: "Unicode conformant character encodings and us-ascii"
Previous message: Ben Dougall: "Re: 'code unit' and 'code point' meaning check"
Maybe in reply to: Ben Dougall: "'code unit' and 'code point' meaning check"
Next in thread: Philippe Verdy: "Fw: 'code unit' and 'code point' meaning check"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 19:31:35 EDT