L2/04-012

Subject: Ignoring Hyphens
Source: Mark Davis
Date: Jan 9, 2004

We say that when comparing property values one should ignore case, whitespace,
underbars, and hyphens. There are some exceptions to this for backwards
compatibility, which are documented in the following list:

=====

  a.. U+0F68 TIBETAN LETTER A and
  U+0F60 TIBETAN LETTER -A

  b.. U+0FB8 TIBETAN SUBJOINED LETTER A and
  U+0FB0 TIBETAN SUBJOINED LETTER -A

  c.. U+116C HANGUL JUNGSEONG OE and
  U+1180 HANGUL JUNGSEONG O-E

=====

Asmus has pointed out that there are some cases of new character names promoted
by WG2 that by analogy should also follow the pattern of -X, and would have to
be added to this exception list. He has suggested that we try to capture the
exceptions as a rule rather than have a fixed list which we would have to
maintain.

Here is a proposed rule to do this:

R1. Ignore case, whitespace, underbar, and all medial hyphens except the hyphen
in U+1180

This adds 11 Tibetan characters where the hyphen is not ignored, but arguably
ones where the hyphen is somehow distinctive. It has now only one exceptional
case
(and we don't anticipate adding any similar Hangul in the future).

=====

Here is some data behind that:

A. As it turns out, there are 72,871 Unicode 4.0 characters containing at least
one hyphen. I will save your mailers and not list them!!

B. Only the following thirteen characters contain non-medial hyphens:

property name: "name"; property value: "(?i)(.*\s)?-.*"
0F02..0F03    # So  [2]    U+0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD
MA..U+0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA
0F13          # So  [1]    U+0F13 TIBETAN MARK CARET -DZUD RTAGS ME LONG CAN
0F17          # So  [1]    U+0F17 TIBETAN ASTROLOGICAL SIGN SGRA GCAN -CHAR
RTAGS
0F18          # Mn  [1]    U+0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
0F36          # So  [1]    U+0F36 TIBETAN MARK CARET -DZUD RTAGS BZHI MIG CAN
0F39          # Mn  [1]    U+0F39 TIBETAN MARK TSA -PHRU
0F60          # Lo  [1]    U+0F60 TIBETAN LETTER -A
0FB0          # Mn  [1]    U+0FB0 TIBETAN SUBJOINED LETTER -A
0FC3          # So  [1]    U+0FC3 TIBETAN CANTILLATION SIGN SBUB -CHAL
0FCA..0FCC    # So  [3]    U+0FCA TIBETAN SYMBOL NOR BU NYIS -KHYIL..U+0FCC
TIBETAN SYMBOL NOR BU BZHI -KHYIL
# Total: 13

property name: "name"; property value: "(?i).*-(/s.*)?"
# Total: 0

Only 2 collide if hyphens are ignored (currently, as discussed above).

C. The following 2 characters contain terminal non-medial hyphen followed by a
single character.

property name: "name"; property value: "(?i).*[^a-z0-9]-."
0F60          # Lo  [1]    U+0F60 TIBETAN LETTER -A
0FB0          # Mn  [1]    U+0FB0 TIBETAN SUBJOINED LETTER -A
# Total: 2

D. The following 4 characters contain "O-E" (only one of which collides if the
hyphen is ignored).

property name: "name"; property value: "(?i).*o-e.*"
117C          # Lo  [1]    U+117C HANGUL JUNGSEONG EO-EU
117F..1180    # Lo  [2]    U+117F HANGUL JUNGSEONG O-EO..U+1180 HANGUL JUNGSEONG
O-E
118B          # Lo  [1]    U+118B HANGUL JUNGSEONG U-EO-EU
# Total: 4

E. The following end with "-E"

property name: "name"; property value: "(?i).*-e"
1180          # Lo  [1]    U+1180 HANGUL JUNGSEONG O-E
1190          # Lo  [1]    U+1190 HANGUL JUNGSEONG YU-E
# Total: 2


Mark