Re: Non-characters in Unicode data files

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Dec 29 2003 - 14:59:08 EST

Next message: Peter Kirk: "Re: Ancient Northwest Semitic Script"

Previous message: Patrick Andries: "Re: UNICODE & OTHER STANDARDS"
In reply to: Philippe Verdy: "Non-characters in Unicode data files"
Next in thread: Philippe Verdy: "Re: Non-characters in Unicode data files"
Reply: Philippe Verdy: "Re: Non-characters in Unicode data files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote:
> I note that the UCD contains lines for PUAs like this:
> ...
> E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
> F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;
> ...
> But why isn't there lines for the _assigned_ Private Local-Use characters in

1. No one saw a need to include them?
2. The documentation file points out that Cn entries are not included:
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
3. See DerivedAge.txt which I point out below.

> the Arabic compatibility block, like:
> ...
> FDD0;<Private Local-Use, First;Cn;0;L;;;;;N;;;;;
> FDEF;<Private Local-Use, First;Cn;0;L;;;;;N;;;;;
> ...
> which seem related and used only for local processing of contextual forms,
> and not restricted to local rendering of Arabic ?

I think it is a legitimate question why the block boundaries were not adjusted to exclude this
non-character range from FB50..FDFF; Arabic Presentation Forms-A (see Blocks.txt).

However, the Unicode standard only points these out as generic non-characters, not for any
particular purpose like "local processing of contextual forms".

> For now, even if it's specified in the text of the standard, it does not
> clearly shows that these characters are assigned but invalid in all versions
> of Unicode, unlike other missing code-points which may be assigned later and
> should not be considered as invalid.

Unicode 3.1 (http://www.unicode.org/reports/tr27/) clarified their usage. See "3.1 Conformance
Requirements (revision)" and then the heading "Noncharacters" a page or so below, including the
definition D7b Noncharacter. See the equivalent parts of Unicode 4.

Noncharacters are not "invalid", but they are "designated" and can therefore not be reassigned:
http://www.unicode.org/alloc/CurrentAllocation.html

I personally find useful the chart for [91-C31] Consensus in
http://www.unicode.org/consortium/utc-minutes/UTC-091-200205.html

> Other non-characters are also absent from the file (which does not contain
> in fact any "Cn" characters), and I wonder why they are not listed:
> ...

See my quote above from UCD.html

> I think that, if these codepoints are effectively permanently assigned as
> invalid, these assignments should be listed.
>
> Another solution would be to list these non-characters in
> DerivedCoreProperties.txt

Well, they are listed in http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
If you search for "noncharacter" there, you will find which ones were designated in which Unicode
version. (Only two were designated in Unicode 1.)

Best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Next message: Peter Kirk: "Re: Ancient Northwest Semitic Script"
Previous message: Patrick Andries: "Re: UNICODE & OTHER STANDARDS"
In reply to: Philippe Verdy: "Non-characters in Unicode data files"
Next in thread: Philippe Verdy: "Re: Non-characters in Unicode data files"
Reply: Philippe Verdy: "Re: Non-characters in Unicode data files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 29 2003 - 15:32:54 EST