L2/04-230 Source: Mark Davis Sent: Thu, 2004 Apr 15 12:33 Subject: Other_Alphabetic and category Nl > That actually made an interesting test case. I put the derivations into the > data-driven test (that nobody responded on...sniff...), and here is the result. > > A. It turned up one doc error: the comment with the decomposition of Default > Ignorable Code Point in C:\DATA\UCD\4.0.1-Update\DerivedCoreProperties-4.0.1.txt > was missing Variation Selector. And it references Annotation_characters, but we > have no property by that name: the characters should be explicitly listed > > b. There are two cases where Other_xxx is not minimal. However, these are not > requirements. We could change them if we want in a future version of the > standard, or leave them as overlapping. > > Other_Alphabetic > Other_Default_Ignorable_Code_Point > > Magda, you can respond that the Other_xxx properties are not guaranteed to be > disjoint from the other properties used in the derivation of the xxx property. > > Mark > > ============= > > > # Invariance tests > # Each line indicates an invariant set relationship to be tested, > # and is of the form: > # > # line := set relation set > # > # relation := '=' // has identical contents to > # := ('>' | '⊃') // is proper superset of > # := ('≥' | '⊇') // is superset of > # := ('<' | '⊂') // is proper subset of > # := ('≤' | '⊆') // is subset of > # := '!' // has no intersection > # := '?' // none of the above (they overlap, and neither > contains the other) > # > # A set is a standard UnicodeSet, but where $pv can be used to express > properties > # > # pv := '$' '×'? prop (('=' | ':') value)? > # > # The × indicates that the property is the previous released version. > # That is, if the version is 4.0.1, then the × version is 4.0.0 > # If the value is missing, it is defaulted to true > # If the value is of the form «...», then the ... is interpreted as a regular > expression > # The property can be the short or long form as in the PropertyAliases.txt > # The value (if enumerated) can be the short or long form as in > PropertyValueAliases.txt > # > # A UnicodeSet is a boolean combinations of properties and character ranges, as > you would see in > # Perl or other regular-expression languages. Examples: > # [$General_Category:Unassigned-[a-zA-Z]] > # For details, see http://oss.software.ibm.com/icu/userguide/unicodeSet.html > # > # WARNING: do not use \p{...} or [:...:] syntax, since those will be > # ICU's current version of properties, not the current snapshot's. > # Use the $ notation for properties (listed above) instead. > # > # When this file is parsed, an error message may contain <@> > # to indicate the location of an error in the input line. > > # The following not very interesting, but show examples of use > > #$GC:Zs ! $GC:Zp > #$East_Asian_Width:Neutral ? $GC:Uppercase_Letter > $GC:Zs ? $Name:«.*SPACE.*» > > # Examples of parsing errors > > # $LBA:Neutral = $GC:Zp # example of non-existant property > # $LB:foo = $GC:Zp # example of non-existant value > # $GC:Zs @ $GC:Zp # example of unknown relation > > # The following should be real invariants > # For illustration, different alias styles are used > > $Line_Break:Unknown = [$General_Category:Unassigned $GeneralCategory:PrivateUse] > $LB:OP = $GC:Ps > $General_Category:Decimal_Number = $Numeric_Type:Decimal > > FALSE > **** START Error Info **** > > In $Numeric_Type:Decimal, but not in $General_Category:Decimal_Number : > > # Total code points: 0 > > Not in $Numeric_Type:Decimal, but in $General_Category:Decimal_Number : > 1369..1371 # Nd [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE > > # Total code points: 9 > > In both $Numeric_Type:Decimal, and in $General_Category:Decimal_Number : > 0030..0039 # Nd [10] DIGIT ZERO..DIGIT NINE > ... > 1D7CE..1D7FF # Nd [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE > DIGIT NINE > > # Total code points: 259 > **** END Error Info **** > > $Whitespace ⊃ [$GC:Zs $GC:Zp $GC:Zl] > > # Comparisons across versions > > $ID_Start ⊇ $×ID_Start > $ID_Continue ⊇ $×ID_Continue > > #$age:4.0.1 = $age4.0.0 > > # Derivations > > $Math = [$GC:Sm $Other_Math] > $Alphabetic = [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl $Other_Alphabetic] > $Lowercase = [$GC:Ll $Other_Lowercase] > $Uppercase = [$GC:Lu $Other_Uppercase] > $ID_Start = [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl $Other_ID_Start] > $ID_Continue = [$ID_Start $GC:Mn $GC:Mc $GC:Nd $GC:Pc] > $Default_Ignorable_Code_Point = [[$Other_Default_Ignorable_Code_Point $GC:Cf > $GC:Cc $GC:Cs $Variation_Selector $Noncharacter_Code_Point] - > [$White_Space\uFFF9-\uFFFB]] > $Grapheme_Extend = [$GC:Me $GC:Mn $Other_Grapheme_Extend] > $Grapheme_Base = [^$GC:Cc $GC:Cf $GC:Cs $GC:Co $GC:Cn $GC:Zl $GC:Zp > $Grapheme_Extend] > > # "Minimal" Other_: NOT hard requirements; just if we want to be minimal > > $Other_Math = [$Math - $GC:Sm] > $Other_Alphabetic = [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]] > > FALSE > **** START Error Info **** > > In [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]], but not in > $Other_Alphabetic : > > # Total code points: 0 > > Not in [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]], but in > $Other_Alphabetic : > 16EE..16F0 # Nl [3] RUNIC ARLAUG SYMBOL..RUNIC BELGTHOR SYMBOL > 2160..2183 # Nl [36] ROMAN NUMERAL ONE..ROMAN NUMERAL REVERSED ONE HUNDRED > 1034A # Nl GOTHIC LETTER NINE HUNDRED > > # Total code points: 40 > > In both [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]], and in > $Other_Alphabetic : > 0345 # Mn COMBINING GREEK YPOGEGRAMMENI > ... > FB1E # Mn HEBREW POINT JUDEO-SPANISH VARIKA > > # Total code points: 389 > **** END Error Info **** > > $Other_Lowercase = [$Lowercase - $GC:Ll] > $Other_Uppercase = [$Uppercase - $GC:Lu] > $Other_ID_Start = [$ID_Start - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]] > $Other_Default_Ignorable_Code_Point = [$Default_Ignorable_Code_Point - [[$GC:Cf > $GC:Cc $GC:Cs $Variation_Selector $Noncharacter_Code_Point] - > [$White_Space\uFFF9-\uFFFB]]] > > FALSE > **** START Error Info **** > > In [$Default_Ignorable_Code_Point - [[$GC:Cf $GC:Cc $GC:Cs $Variation_Selector > $Noncharacter_Code_Point] - [$White_Space\uFFF9-\uFFFB]]], but not in > $Other_Default_Ignorable_Code_Point : > > # Total code points: 0 > > Not in [$Default_Ignorable_Code_Point - [[$GC:Cf $GC:Cc $GC:Cs > $Variation_Selector $Noncharacter_Code_Point] - [$White_Space\uFFF9-\uFFFB]]], > but in $Other_Default_Ignorable_Code_Point : > 200B # Zs ZERO WIDTH SPACE > > # Total code points: 1 > > In both [$Default_Ignorable_Code_Point - [[$GC:Cf $GC:Cc $GC:Cs > $Variation_Selector $Noncharacter_Code_Point] - [$White_Space\uFFF9-\uFFFB]]], > and in $Other_Default_Ignorable_Code_Point : > 034F # Mn COMBINING GRAPHEME JOINER > ... > E01F0..E0FFF # Cn [3600] .. > > # Total code points: 3779 > **** END Error Info **** > > $Other_Grapheme_Extend = [$Grapheme_Extend - [$GC:Me $GC:Mn]] > > **** SUMMARY **** > > ParseErrorCount=0 > TestFailureCount=3 > > > Mark > > ----- Original Message ----- > From: "Magda Danish (Unicode)" > To: > Sent: Thu, 2004 Apr 15 09:16 > Subject: FW: Web Form: Subj: Other_Alphabetic and category Nl > > > ________________________________ > > Date/Time: Thu Apr 15 00:16:56 EDT 2004 > Contact: ernestcline@mindspring.com > Report Type: Error Report > Opt Subject: Other_Alphabetic and category Nl > > Unicode 4.0.1 > > The Alphabetic property is defined in UCD.html and in DerivedCoreProperties.txt > as being generated by: > Other_Alphabetic + Lu + Ll + Lt + Lm + Lo + Nl > Why then are characters of General Category Nl given the Other_Alphabetic > property? It would seem that the only use made of the Other_Alphabetic property > is to generate the Alphabetic property and that removing this property from the > characters of General Category Nl would simplify things slightly as by the given > definition, they should already be included. > > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- > (End of Report)