L2/04-230

Source: Mark Davis
Sent: Thu, 2004 Apr 15 12:33
Subject: Other_Alphabetic and category Nl


> That actually made an interesting test case. I put the derivations into the
> data-driven test (that nobody responded on...sniff...), and here is the result.

>
> A. It turned up one doc error: the comment with the decomposition of Default
> Ignorable Code Point in

C:\DATA\UCD\4.0.1-Update\DerivedCoreProperties-4.0.1.txt

> was missing Variation Selector. And it references Annotation_characters, but we

> have no property by that name: the characters should be explicitly listed
>
> b. There are two cases where Other_xxx is not minimal. However, these are not
> requirements. We could change them if we want in a future version of the
> standard, or leave them as overlapping.
>
> Other_Alphabetic
> Other_Default_Ignorable_Code_Point
>
> Magda, you can respond that the Other_xxx properties are not guaranteed to be
> disjoint from the other properties used in the derivation of the xxx property.
>
> Mark
>
> =============
>
>
> # Invariance tests
> # Each line indicates an invariant set relationship to be tested,
> # and is of the form:
> #
> #  line := set relation set
> #
> #   relation := '='             // has identical contents to
> #            := ('>' | '⊃')    // is proper superset of
> #            := ('≥' | '⊇')    // is superset of
> #            := ('<' | '⊂')    // is proper subset of
> #            := ('≤' | '⊆')    // is subset of
> #            := '!'             // has no intersection
> #            := '?'             // none of the above (they overlap, and

neither

> contains the other)
> #
> # A set is a standard UnicodeSet, but where $pv can be used to express
> properties
> #
> #  pv := '$' '×'? prop (('=' | ':') value)?
> #
> # The × indicates that the property is the previous released version.
> #  That is, if the version is 4.0.1, then the × version is 4.0.0
> # If the value is missing, it is defaulted to true
> # If the value is of the form «...», then the ... is interpreted as a regular
> expression
> # The property can be the short or long form as in the PropertyAliases.txt
> # The value (if enumerated) can be the short or long form as in
> PropertyValueAliases.txt
> #
> # A UnicodeSet is a boolean combinations of properties and character ranges,

as

> you would see in
> #  Perl or other regular-expression languages. Examples:
> # [$General_Category:Unassigned-[a-zA-Z]]
> # For details, see http://oss.software.ibm.com/icu/userguide/unicodeSet.html
> #
> # WARNING: do not use \p{...} or [:...:] syntax, since those will be
> # ICU's current version of properties, not the current snapshot's.
> # Use the $ notation for properties (listed above) instead.
> #
> # When this file is parsed, an error message may contain <@>
> #  to indicate the location of an error in the input line.
>
> # The following not very interesting, but show examples of use
>
> #$GC:Zs ! $GC:Zp
> #$East_Asian_Width:Neutral ? $GC:Uppercase_Letter
> $GC:Zs ? $Name:«.*SPACE.*»
>
> # Examples of parsing errors
>
> # $LBA:Neutral =  $GC:Zp # example of non-existant property
> # $LB:foo =  $GC:Zp # example of non-existant value
> # $GC:Zs @ $GC:Zp # example of unknown relation
>
> # The following should be real invariants
> # For illustration, different alias styles are used
>
> $Line_Break:Unknown = [$General_Category:Unassigned

$GeneralCategory:PrivateUse]

> $LB:OP = $GC:Ps
> $General_Category:Decimal_Number = $Numeric_Type:Decimal
>
> FALSE
> **** START Error Info ****
>
> In $Numeric_Type:Decimal, but not in $General_Category:Decimal_Number :
>
> # Total code points: 0
>
> Not in $Numeric_Type:Decimal, but in $General_Category:Decimal_Number :
> 1369..1371     # Nd   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
>
> # Total code points: 9
>
> In both $Numeric_Type:Decimal, and in $General_Category:Decimal_Number :
> 0030..0039     # Nd  [10] DIGIT ZERO..DIGIT NINE
> ...
> 1D7CE..1D7FF   # Nd  [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE
> DIGIT NINE
>
> # Total code points: 259
> **** END Error Info ****
>
> $Whitespace ⊃ [$GC:Zs $GC:Zp $GC:Zl]
>
> # Comparisons across versions
>
> $ID_Start ⊇ $×ID_Start
> $ID_Continue ⊇ $×ID_Continue
>
> #$age:4.0.1 = $age4.0.0
>
> # Derivations
>
> $Math = [$GC:Sm $Other_Math]
> $Alphabetic = [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl $Other_Alphabetic]
> $Lowercase = [$GC:Ll $Other_Lowercase]
> $Uppercase = [$GC:Lu $Other_Uppercase]
> $ID_Start = [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl $Other_ID_Start]
> $ID_Continue = [$ID_Start $GC:Mn $GC:Mc $GC:Nd $GC:Pc]
> $Default_Ignorable_Code_Point = [[$Other_Default_Ignorable_Code_Point $GC:Cf
> $GC:Cc $GC:Cs $Variation_Selector $Noncharacter_Code_Point] -
> [$White_Space\uFFF9-\uFFFB]]
> $Grapheme_Extend = [$GC:Me $GC:Mn $Other_Grapheme_Extend]
> $Grapheme_Base = [^$GC:Cc $GC:Cf $GC:Cs $GC:Co $GC:Cn $GC:Zl $GC:Zp
> $Grapheme_Extend]
>
> # "Minimal" Other_: NOT hard requirements; just if we want to be minimal
>
> $Other_Math = [$Math - $GC:Sm]
> $Other_Alphabetic = [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo

$GC:Nl]]

>
> FALSE
> **** START Error Info ****
>
> In [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]], but not in
> $Other_Alphabetic :
>
> # Total code points: 0
>
> Not in [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]], but in
> $Other_Alphabetic :
> 16EE..16F0     # Nl   [3] RUNIC ARLAUG SYMBOL..RUNIC BELGTHOR SYMBOL
> 2160..2183     # Nl  [36] ROMAN NUMERAL ONE..ROMAN NUMERAL REVERSED ONE

HUNDRED

> 1034A          # Nl       GOTHIC LETTER NINE HUNDRED
>
> # Total code points: 40
>
> In both [$Alphabetic - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]], and in
> $Other_Alphabetic :
> 0345           # Mn       COMBINING GREEK YPOGEGRAMMENI
> ...
> FB1E           # Mn       HEBREW POINT JUDEO-SPANISH VARIKA
>
> # Total code points: 389
> **** END Error Info ****
>
> $Other_Lowercase = [$Lowercase - $GC:Ll]
> $Other_Uppercase = [$Uppercase - $GC:Lu]
> $Other_ID_Start = [$ID_Start - [$GC:Lu $GC:Ll $GC:Lt $GC:Lm $GC:Lo $GC:Nl]]
> $Other_Default_Ignorable_Code_Point = [$Default_Ignorable_Code_Point -

[[$GC:Cf

> $GC:Cc $GC:Cs $Variation_Selector $Noncharacter_Code_Point] -
> [$White_Space\uFFF9-\uFFFB]]]
>
> FALSE
> **** START Error Info ****
>
> In [$Default_Ignorable_Code_Point - [[$GC:Cf $GC:Cc $GC:Cs $Variation_Selector
> $Noncharacter_Code_Point] - [$White_Space\uFFF9-\uFFFB]]], but not in
> $Other_Default_Ignorable_Code_Point :
>
> # Total code points: 0
>
> Not in [$Default_Ignorable_Code_Point - [[$GC:Cf $GC:Cc $GC:Cs
> $Variation_Selector $Noncharacter_Code_Point] - [$White_Space\uFFF9-\uFFFB]]],
> but in $Other_Default_Ignorable_Code_Point :
> 200B           # Zs       ZERO WIDTH SPACE
>
> # Total code points: 1
>
> In both [$Default_Ignorable_Code_Point - [[$GC:Cf $GC:Cc $GC:Cs
> $Variation_Selector $Noncharacter_Code_Point] - [$White_Space\uFFF9-\uFFFB]]],
> and in $Other_Default_Ignorable_Code_Point :
> 034F           # Mn       COMBINING GRAPHEME JOINER
> ...
> E01F0..E0FFF   # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>
>
> # Total code points: 3779
> **** END Error Info ****
>
> $Other_Grapheme_Extend = [$Grapheme_Extend - [$GC:Me $GC:Mn]]
>
> **** SUMMARY ****
>
> ParseErrorCount=0
> TestFailureCount=3
>
>
> Mark


>
> ----- Original Message ----- 
> From: "Magda Danish (Unicode)" <v-magdad@microsoft.com>
> To: <book@unicode.org>
> Sent: Thu, 2004 Apr 15 09:16
> Subject: FW: Web Form: Subj: Other_Alphabetic and category Nl
>
>
> ________________________________
>
> Date/Time:    Thu Apr 15 00:16:56 EDT 2004
> Contact:      ernestcline@mindspring.com
> Report Type:  Error Report
> Opt Subject:  Other_Alphabetic and category Nl
>
> Unicode 4.0.1
>
> The Alphabetic property is defined in UCD.html and in

DerivedCoreProperties.txt

> as being generated by:
>   Other_Alphabetic + Lu + Ll + Lt + Lm + Lo + Nl
> Why then are characters of General Category Nl given the Other_Alphabetic
> property?  It would seem that the only use made of the Other_Alphabetic

property

> is to generate the Alphabetic property and that removing this property from

the

> characters of General Category Nl would simplify things slightly as by the

given

> definition, they should already be included.
>
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> (End of Report)