L2/01-009R From: Mark Davis [mark@macchiato.com] Sent: Friday, December 29, 2000 11:18 AM Agenda Item: Cn - category definition In http://www.unicode.org/Public/UNIDATA/UnicodeData.html, we define the General Category Cn as follows: Cn Other, Not Assigned (no characters in the file have this property) I believe this needs to be changed to: Cn Other, Noncharacters, Not Assigned (all code points with this property are omitted from the UnicodeData file, but may occur in other files in the UCD) Because the Unicode general category is a partition of the entire coding space from 0000 to 10FFFF, Cn must also include the non-characters **FFFE and **FFFF. This is especially true given the clarifications we have made recently, and the newly assigned noncharacters in Unicode 3.2. All known APIs for Unicode general categories return this value. [Cleaner would be to have a different value Cx for noncharacters, just to allow people to easily distinguish them, but we can't do that. For that, people have to rely on the newly refurbished PropList.] Comments from Ken Whistler: From: Kenneth Whistler [kenw@sybase.com] Sent: Tuesday, January 02, 2001 5:23 PM Subject: Re: Agenda Item Mark, > > It is for the UTC meeting. However, I wanted to have some discussion here > first. If there is no discussion within a day or two, My comments below. > then could you put it > on the meeting agenda? > > Mark > ___ > > Is this proposed for the editorial committee meeting or the UTC? Thanks... > > Lisa > > > > "Mark Davis" on 12/29/2000 08:17:30 AM > cc: > Subject: Agenda Item > > > > In http://www.unicode.org/Public/UNIDATA/UnicodeData.html, we define the > General Category Cn as follows: > > Cn Other, Not Assigned (no characters in the file have this property) > > I believe this needs to be changed to: > > Cn Other, Noncharacters, Not Assigned > (all code points with this property are omitted from the UnicodeData > file, > but may occur in other files in the UCD) I am not opposed to making this reinterpretation, but I think it deserves a fuller explanation. It is important to clarify that the General Category is being treated as a partition of the *codespace*, rather than as a partition of the encoded characters. Given that understanding, then all the code points that have been assigned get some value other than Cn, and Cn applies to all other code points. > > Because the Unicode general category is a partition of the entire coding > space from 0000 to 10FFFF, Cn must also include the non-characters **FFFE > and **FFFF. This is especially true given the clarifications we have made > recently, and the newly assigned noncharacters in Unicode 3.2. Unicode 3.1, not Unicode 3.2. > All known > APIs for Unicode general categories return this value. Not the Sybase API. ;-) Since I didn't believe in the General Category in the first place, I didn't build the Sybase character property API around it. I may have to *add* a getGeneralCategory() API in the future that mimics other API's, if clients feel they need it, but at the moment my library doesn't return a "Cn" value. > > [Cleaner would be to have a different value Cx for noncharacters, just to > allow people to easily distinguish them, but we can't do that. For that, > people have to rely on the newly refurbished PropList.] I agree that addition of a new value Cx would be out of the question now. --Ken