Re: Tweaks to UTS#39 data and text
From: Mark Davis
I was contacted by people from Mozilla, who had found some problems in the text and data. I propose the following:
1. We clarify the use of Simplified & Traditional tests.
I responded to a question about these as follows. We should add an explanation of this.
“A. The test can only be applied if the characters are meant to be chinese. So " 写真だけの結婚式" is Japanese, and shouldn't be tested.
B. The test for S vs T needs to be not whether the character has a T or S variant, but whether the character is an S or T variant. In any event, we need to be much clearer in that section exactly how to use Unihan.”
2. We should include some characters we are wrongly excluding.
I responded a question on U+30FB KATAKANA MIDDLE DOT as follows:
“This appears to be a production problem. The list marked as:
xxxx ; allowed ; inclusion
should a be those characters that are in http://www.unicode.org/reports/tr39/#Identifier_Modification_Key under 'inclusion'.
Those characters should match the characters in http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers. However, they do not reflect them. The missing characters are:
U+0027 ( ' ) APOSTROPHE
U+003A ( : ) COLON
U+058A ( ֊ ) ARMENIAN HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2027 ( ‧ ) HYPHENATION POINT
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT
Of course, for the purpose of IDNA, the ASCII characters are determined by the base spec, but the others allowed. (They are subject to confusability tests, also, but that's a different story.)”
3. There are some other characters that should be added:
“It turns out that the following IDNs:
all fail our safety checks, despite being the normal spelling in the languages in question of their own names. The specific problem characters that were flagged are:
U+0259 LATIN SMALL LETTER SCHWA (limited-use)
U+2018 LEFT SINGLE QUOTATION MARK (non-xid)
U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (technical)
U+05B4 HEBREW POINT HIRIQ (limited-use)”
4. Markus: “UTS #39 should have a description for how its data is generated.” I agree.
5. We have the request:
“Is there a way to be notified of updates to xidmodifications.txt, other
than writing a script to download it every day and check the date in the
I responded: “It gets released with each version of Unicode, which is about once a year. You should also look for announcements of the beta of a new Unicode version. unicode.org has mailing lists, a blog, and tweets...”
I’m not sure what else we can do about notification. Perhaps the committee has some ideas?