re: Fixes to UAX#31, UTS#39
from: Mark Davis
Here are a few fixes that are needed, resulting from an investigation of a report of problems from Mozilla (Firefox).
1. We removed colon ( : ) from MidLetter in #29:
However, the reason that it is in the inclusion table in #31 is is because it was in MidLetter. So we should remove it from that table in #31 as well.
That is, remove “003A (:) COLON” from:
2. Being in that table in #31 is the basis for the ‘inclusion’ value in #39:
However, the data in http://www.unicode.org/Public/security/revision-05/xidmodifications.txt is not aligned with the values in #31. In particular, the KATAKANA MIDDLE DOT is missing, which is part of IDNA2008.
30FB; CONTEXTO # KATAKANA MIDDLE DOT
That is, add the following lines to xidmodification (and remove the corresponding entries from the other values):
0027 ; allowed ; inclusion # ( ' ) APOSTROPHE
058A ; allowed ; inclusion # ( ֊ ) ARMENIAN HYPHEN
2010 ; allowed ; inclusion # ( ‐ ) HYPHEN
2027 ; allowed ; inclusion # ( ‧ ) HYPHENATION POINT
30A0 ; allowed ; inclusion # ( ゠ ) KATAKANA-HIRAGANA...
30FB ; allowed ; inclusion # ( ・ ) KATAKANA MIDDLE DOT
However, we should also add text that makes it clear that:
Target applications may need to filter these characters. In particular, IDNs have specific requirements on characters that would exclude some of this; some other characters may be restricted on confusability grounds, notably hyphen.
3. There are 4 other characters that are in IDNA2008, but not in the inclusion list.
3007; PVALID # IDEOGRAPHIC NUMBER ZERO
Of these, U+3007 is already in the recommended list (it is in XID_Continue). The three others are listed below.
06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
These are allowed in #46, but not in #39, because they are no XID_Continue (they are General_Category=Other_Symbol and General_Category=Modifier_Symbol). These are bizarre additions to IDNA2008, but for consistency I propose that we broaden the definition of the ‘inclusion’ value in #39 to add these three characters, and document the reason: that it is for compatibility with IDNA2008 and consistency with #46. That would mean adding to the data file as:
06FD ; allowed ; inclusion # ARABIC SIGN SINDHI AMPERSAND
06FE ; allowed ; inclusion # ARABIC SIGN SINDHI POSTPOSI...
0375 ; allowed ; inclusion # GREEK LOWER NUMERAL SIGN...
4. We should have a special review of ASCII non-alphanumerics for confusables. We have focused on alphanumerics, but these characters are often used as syntax characters, so the confusables are especially interesting. For example, possibilities to review for # and + are:
U+0023 ( # ) NUMBER SIGN
U+FE5F ( ﹟ ) SMALL NUMBER SIGN
U+FF03 ( ＃ ) FULLWIDTH NUMBER SIGN
U+266F ( ♯ ) MUSIC SHARP SIGN
U+002B ( + ) PLUS SIGN
U+1429 ( ᐩ ) CANADIAN SYLLABICS FINAL PLUS
U+207A ( ⁺ ) SUPERSCRIPT PLUS SIGN
U+208A ( ₊ ) SUBSCRIPT PLUS SIGN
U+FE62 ( ﹢ ) SMALL PLUS SIGN
U+FF0B ( ＋ ) FULLWIDTH PLUS SIGN
and a bit further afield:
U+2795 ( ➕ ) HEAVY PLUS SIGN
U+2629 ( ☩ ) CROSS OF JERUSALEM
U+16ED ( ᛭ ) RUNIC CROSS PUNCTUATION
U+2719 ( ✙ ) OUTLINED GREEK CROSS
U+271A ( ✚ ) HEAVY GREEK CROSS
U+271B ( ✛ ) OPEN CENTRE CROSS
U+1F542 ( 🕂 ) CROSS POMMEE