L2/01-020 From: Mark Davis [mark@macchiato.com] Sent: Monday, January 08, 2001 11:31 AM General Category for Noncharacters In http://www.unicode.org/Public/UNIDATA/UnicodeData.html, we define the General Category Cn as follows: Cn Other, Not Assigned (no characters in the file have this property) In practice, the Unicode general category is a partition of the entire coding space from 0000 to 10FFFF. All known APIs that return the Unicode general category return Cn for FFFF (since they have to return *some* value). Because of this, Cn should also explicitly include the non-characters such as FFFE and FFFF. This is especially true given the clarifications we have made recently, and the newly assigned noncharacters in Unicode 3.1. [Cleaner would be to have a different value Cx for noncharacters, just to allow people to easily distinguish them, but we can't do that. For that, people have to rely on the newly refurbished PropList.] The proposed changes to clarify this situation are: 1. Change the definition to: Cn Other, Noncharacter Code Points, Not Assigned code points (all code points with this property are omitted from the UnicodeData file, but may occur in other files in the UCD) 2. Add text to added to clarify that the General Category is a partition of the *codespace* from 0000 to 10FFFF, rather than a partition of the encoded characters. In particular: - all the code points that have been assigned to characters get some value other than Cn or Cs - Cs is applied to surrogate code points, and - Cn applies to all other code points (unassigned and noncharacter)