Re: Non-characters in Unicode data files

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Dec 29 2003 - 14:59:08 EST

  • Next message: Peter Kirk: "Re: Ancient Northwest Semitic Script"

    Philippe Verdy wrote:
    > I note that the UCD contains lines for PUAs like this:
    > ...
    > E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
    > F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;
    > ...
    > But why isn't there lines for the _assigned_ Private Local-Use characters in

    1. No one saw a need to include them?
    2. The documentation file points out that Cn entries are not included:
       http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
    3. See DerivedAge.txt which I point out below.

    > the Arabic compatibility block, like:
    > ...
    > FDD0;<Private Local-Use, First;Cn;0;L;;;;;N;;;;;
    > FDEF;<Private Local-Use, First;Cn;0;L;;;;;N;;;;;
    > ...
    > which seem related and used only for local processing of contextual forms,
    > and not restricted to local rendering of Arabic ?

    I think it is a legitimate question why the block boundaries were not adjusted to exclude this
    non-character range from FB50..FDFF; Arabic Presentation Forms-A (see Blocks.txt).

    However, the Unicode standard only points these out as generic non-characters, not for any
    particular purpose like "local processing of contextual forms".

    > For now, even if it's specified in the text of the standard, it does not
    > clearly shows that these characters are assigned but invalid in all versions
    > of Unicode, unlike other missing code-points which may be assigned later and
    > should not be considered as invalid.

    Unicode 3.1 (http://www.unicode.org/reports/tr27/) clarified their usage. See "3.1 Conformance
    Requirements (revision)" and then the heading "Noncharacters" a page or so below, including the
    definition D7b Noncharacter. See the equivalent parts of Unicode 4.

    Noncharacters are not "invalid", but they are "designated" and can therefore not be reassigned:
    http://www.unicode.org/alloc/CurrentAllocation.html

    I personally find useful the chart for [91-C31] Consensus in
    http://www.unicode.org/consortium/utc-minutes/UTC-091-200205.html

    > Other non-characters are also absent from the file (which does not contain
    > in fact any "Cn" characters), and I wonder why they are not listed:
    > ...

    See my quote above from UCD.html

    > I think that, if these codepoints are effectively permanently assigned as
    > invalid, these assignments should be listed.
    >
    > Another solution would be to list these non-characters in
    > DerivedCoreProperties.txt

    Well, they are listed in http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
    If you search for "noncharacter" there, you will find which ones were designated in which Unicode
    version. (Only two were designated in Unicode 1.)

    Best regards,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Mon Dec 29 2003 - 15:32:54 EST