Re: Tweaks to UTS#39 data and text
From: Mark Davis
I was contacted by people from Mozilla, who had found some problems in the text and data. I propose the following:
1. We clarify the use of Simplified & Traditional tests.
I responded to a question about these as follows. We should add an explanation of this.
“A. The test can only be applied if the characters are meant to be chinese. So " 写真だけの結婚式" is Japanese, and shouldn't be tested.
B. The test for S vs T needs to be not whether the character has a T or S variant, but whether the character is an S or T variant. In any event, we need to be much clearer in that section exactly how to use Unihan.”
2. We should include some characters we are wrongly excluding.
I responded a question on U+30FB KATAKANA MIDDLE DOT as follows:
“This appears to be a production problem. The list marked as:
xxxx ; allowed ; inclusion
should a be those characters that are in http://www.unicode.org/reports/tr39/#Identifier_Modification_Key under 'inclusion'.
Those characters should match the characters in http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers. However, they do not reflect them. The missing characters are:
U+0027 ( ' ) APOSTROPHE
U+003A ( : ) COLON
U+058A ( ֊ ) ARMENIAN HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2027 ( ‧ ) HYPHENATION POINT
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT
Of course, for the purpose of IDNA, the ASCII characters are determined by the base spec, but the others allowed. (They are subject to confusability tests, also, but that's a different story.)”
We should also add text that makes it clear that:
Target applications may need to filter these characters. In particular, IDNs have specific requirements on characters that would exclude some of this; some other characters may be restricted on confusability grounds, notably hyphen.
3. There are some other characters that should be added:
“It turns out that the following IDNs:
all fail our safety checks, despite being the normal spelling in the languages in question of their own names. The specific problem characters that were flagged are:
U+0259 LATIN SMALL LETTER SCHWA (limited-use)
U+2018 LEFT SINGLE QUOTATION MARK (non-xid)
U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (technical)
U+05B4 HEBREW POINT HIRIQ (limited-use)”
4. Markus: “UTS #39 should have a description for how its data is generated.” I agree.
5. We have the request:
“Is there a way to be notified of updates to xidmodifications.txt, other
than writing a script to download it every day and check the date in the
I responded: “It gets released with each version of Unicode, which is about once a year. You should also look for announcements of the beta of a new Unicode version. unicode.org has mailing lists, a blog, and tweets...”
I’m not sure what else we can do about notification. Perhaps the committee has some ideas?
6. There are 4 other characters that are in IDNA2008, but not in the inclusion list.
3007; PVALID # IDEOGRAPHIC NUMBER ZERO
Of these, U+3007 is already in the recommended list (it is in XID_Continue). The three others are listed below.
06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
These are allowed in #46, but not in #39, because they are no XID_Continue (they are General_Category=Other_Symbol and General_Category=Modifier_Symbol). These are bizarre additions to IDNA2008, but for consistency I propose that we broaden the definition of the ‘inclusion’ value in #39 to add these three characters, and document the reason: that it is for compatibility with IDNA2008 and consistency with #46. That would mean adding to the data file as:
06FD ; allowed ; inclusion # ARABIC SIGN SINDHI AMPERSAND
06FE ; allowed ; inclusion # ARABIC SIGN SINDHI POSTPOSI...
0375 ; allowed ; inclusion # GREEK LOWER NUMERAL SIGN...
5. We should have a special review of ASCII non-alphanumerics for confusables. We have focused on alphanumerics, but these characters are often used as syntax characters, so the confusables are especially interesting. For example, possibilities to review for # and + are:
U+0023 ( # ) NUMBER SIGN
U+FE5F ( ﹟ ) SMALL NUMBER SIGN
U+FF03 ( ＃ ) FULLWIDTH NUMBER SIGN
U+266F ( ♯ ) MUSIC SHARP SIGN
U+002B ( + ) PLUS SIGN
U+1429 ( ᐩ ) CANADIAN SYLLABICS FINAL PLUS
U+207A ( ⁺ ) SUPERSCRIPT PLUS SIGN
U+208A ( ₊ ) SUBSCRIPT PLUS SIGN
U+FE62 ( ﹢ ) SMALL PLUS SIGN
U+FF0B ( ＋ ) FULLWIDTH PLUS SIGN
and a bit further afield:
U+2795 ( ➕ ) HEAVY PLUS SIGN
U+2629 ( ☩ ) CROSS OF JERUSALEM
U+16ED ( ᛭ ) RUNIC CROSS PUNCTUATION
U+2719 ( ✙ ) OUTLINED GREEK CROSS
U+271A ( ✚ ) HEAVY GREEK CROSS
U+271B ( ✛ ) OPEN CENTRE CROSS
U+1F542 ( 🕂 ) CROSS POMMEE
I suggest that Michel and I get an action to review such characters and add others.