L2/04-012 Subject: Ignoring Hyphens Source: Mark Davis Date: Jan 9, 2004 We say that when comparing property values one should ignore case, whitespace, underbars, and hyphens. There are some exceptions to this for backwards compatibility, which are documented in the following list: ===== a.. U+0F68 TIBETAN LETTER A and U+0F60 TIBETAN LETTER -A b.. U+0FB8 TIBETAN SUBJOINED LETTER A and U+0FB0 TIBETAN SUBJOINED LETTER -A c.. U+116C HANGUL JUNGSEONG OE and U+1180 HANGUL JUNGSEONG O-E ===== Asmus has pointed out that there are some cases of new character names promoted by WG2 that by analogy should also follow the pattern of -X, and would have to be added to this exception list. He has suggested that we try to capture the exceptions as a rule rather than have a fixed list which we would have to maintain. Here is a proposed rule to do this: R1. Ignore case, whitespace, underbar, and all medial hyphens except the hyphen in U+1180 This adds 11 Tibetan characters where the hyphen is not ignored, but arguably ones where the hyphen is somehow distinctive. It has now only one exceptional case (and we don't anticipate adding any similar Hangul in the future). ===== Here is some data behind that: A. As it turns out, there are 72,871 Unicode 4.0 characters containing at least one hyphen. I will save your mailers and not list them!! B. Only the following thirteen characters contain non-medial hyphens: property name: "name"; property value: "(?i)(.*\s)?-.*" 0F02..0F03 # So [2] U+0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA..U+0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA 0F13 # So [1] U+0F13 TIBETAN MARK CARET -DZUD RTAGS ME LONG CAN 0F17 # So [1] U+0F17 TIBETAN ASTROLOGICAL SIGN SGRA GCAN -CHAR RTAGS 0F18 # Mn [1] U+0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA 0F36 # So [1] U+0F36 TIBETAN MARK CARET -DZUD RTAGS BZHI MIG CAN 0F39 # Mn [1] U+0F39 TIBETAN MARK TSA -PHRU 0F60 # Lo [1] U+0F60 TIBETAN LETTER -A 0FB0 # Mn [1] U+0FB0 TIBETAN SUBJOINED LETTER -A 0FC3 # So [1] U+0FC3 TIBETAN CANTILLATION SIGN SBUB -CHAL 0FCA..0FCC # So [3] U+0FCA TIBETAN SYMBOL NOR BU NYIS -KHYIL..U+0FCC TIBETAN SYMBOL NOR BU BZHI -KHYIL # Total: 13 property name: "name"; property value: "(?i).*-(/s.*)?" # Total: 0 Only 2 collide if hyphens are ignored (currently, as discussed above). C. The following 2 characters contain terminal non-medial hyphen followed by a single character. property name: "name"; property value: "(?i).*[^a-z0-9]-." 0F60 # Lo [1] U+0F60 TIBETAN LETTER -A 0FB0 # Mn [1] U+0FB0 TIBETAN SUBJOINED LETTER -A # Total: 2 D. The following 4 characters contain "O-E" (only one of which collides if the hyphen is ignored). property name: "name"; property value: "(?i).*o-e.*" 117C # Lo [1] U+117C HANGUL JUNGSEONG EO-EU 117F..1180 # Lo [2] U+117F HANGUL JUNGSEONG O-EO..U+1180 HANGUL JUNGSEONG O-E 118B # Lo [1] U+118B HANGUL JUNGSEONG U-EO-EU # Total: 4 E. The following end with "-E" property name: "name"; property value: "(?i).*-e" 1180 # Lo [1] U+1180 HANGUL JUNGSEONG O-E 1190 # Lo [1] U+1190 HANGUL JUNGSEONG YU-E # Total: 2 Mark