L2/06-386

Source: Mark Davis
Date: November 9, 2006
Subject: Properties not preserving canonical equivalence


In response to the issues raised by Kent, I wrote a test during the break in the meeting and ran it over a number of the properties, to determine when the values change if the character is NFC'ed. As a simplification, I just look at whether the properties for the first character of the decomposition change. This will catch all the common cases: singletons and base+accents. Here are the results. In particular it looks like we also need to do the Word_Break property for FA30..FA6A, and we might want to align some others like East_Asian_Width.

Properties tested:

ASCII_Hex_Digit, Alphabetic, Bidi_Class, Bidi_Control, Bidi_Mirrored, Canonical_Combining_Class, Case_Fold_Turkish_I, Dash, Default_Ignorable_Code_Point, Diacritic, East_Asian_Width, Extender, General_Category, Grapheme_Base, Grapheme_Cluster_Break, Grapheme_Extend, Grapheme_Link, Hangul_Syllable_Type, Hex_Digit, Hyphen, IDS_Binary_Operator, IDS_Trinary_Operator, ID_Continue, ID_Start, Ideographic, Join_Control, Joining_Group, Joining_Type, Line_Break, Logical_Order_Exception, Lowercase, Math, Non_Break, Noncharacter_Code_Point, Numeric_Type, Numeric_Value, Other_Alphabetic, Other_Default_Ignorable_Code_Point, Other_Grapheme_Extend, Other_ID_Continue, Other_ID_Start, Other_Lowercase, Other_Math, Other_Uppercase, Pattern_Syntax, Pattern_White_Space, Quotation_Mark, Radical, STerm, Script, Sentence_Break, Soft_Dotted, Terminal_Punctuation, Uppercase, Variation_Selector, White_Space, Word_Break, XID_Continue, XID_Start

Cases where differences are found:

[Alphabetic, General_Category, ID_Continue, ID_Start, Script, Sentence_Break, Word_Break, XID_Continue, XID_Start]
0374           # Sk       GREEK NUMERAL SIGN

# Total code points: 1

[East_Asian_Width, Pattern_Syntax]
037E           # Po       GREEK QUESTION MARK

# Total code points: 1

[Diacritic, East_Asian_Width, Extender, Line_Break, Terminal_Punctuation, Word_Break, XID_Continue]
0387           # Po       GREEK ANO TELEIA

# Total code points: 1

[Canonical_Combining_Class]
0F73           # Mn       TIBETAN VOWEL SIGN II
0F75           # Mn       TIBETAN VOWEL SIGN UU
0F81           # Mn       TIBETAN VOWEL SIGN REVERSED II

# Total code points: 3

[East_Asian_Width]
1FBE           # L&       GREEK PROSGEGRAMMENI
212A           # L&       KELVIN SIGN

# Total code points: 2

[East_Asian_Width, Pattern_Syntax, Script]
1FEF           # Sk       GREEK VARIA

# Total code points: 1

[East_Asian_Width, Line_Break, Script]
1FFD           # Sk       GREEK OXIA

# Total code points: 1

[East_Asian_Width, Line_Break]
212B           # L&       ANGSTROM SIGN

# Total code points: 1

[Bidi_Mirrored]
2ADC           # Sm       FORKING

# Total code points: 1

[Numeric_Type, Numeric_Value]
F96B           # Lo       CJK COMPATIBILITY IDEOGRAPH-F96B
F973           # Lo       CJK COMPATIBILITY IDEOGRAPH-F973
F978           # Lo       CJK COMPATIBILITY IDEOGRAPH-F978
F9B2           # Lo       CJK COMPATIBILITY IDEOGRAPH-F9B2
F9D1           # Lo       CJK COMPATIBILITY IDEOGRAPH-F9D1
F9D3           # Lo       CJK COMPATIBILITY IDEOGRAPH-F9D3
F9FD           # Lo       CJK COMPATIBILITY IDEOGRAPH-F9FD
2F890          # Lo       CJK COMPATIBILITY IDEOGRAPH-2F890

# Total code points: 8

[Ideographic, Word_Break]
FA30..FA6A     # Lo  [59] CJK COMPATIBILITY IDEOGRAPH-FA30..CJK COMPATIBILITY IDEOGRAPH-FA6A

# Total code points: 59