Tweaks to UTS#39 data and text

L2/13-070R1

Re: Tweaks to UTS#39 data and text

From: Mark Davis

I was contacted by people from Mozilla, who had found some problems in the text and data. I propose the following:

1. We clarify the use of Simplified & Traditional tests.

I responded to a question about these as follows. We should add an explanation of this.

“A. The test can only be applied if the characters are meant to be chinese. So " 写真だけの結婚式" is Japanese, and shouldn't be tested.

B. The test for S vs T needs to be not whether the character has a T or S variant, but whether the character is an S or T variant. In any event, we need to be much clearer in that section exactly how to use Unihan.”

2. We should include some characters we are wrongly excluding.

I responded a question on U+30FB KATAKANA MIDDLE DOT as follows:

“This appears to be a production problem. The list marked as:

xxxx ; allowed ; inclusion

should a be those characters that are in http://www.unicode.org/reports/tr39/#Identifier_Modification_Key under 'inclusion'.

Those characters should match the characters in http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers. However, they do not reflect them. The missing characters are:

U+0027 ( ' ) APOSTROPHE

U+003A ( : ) COLON

U+058A ( ֊ ) ARMENIAN HYPHEN

U+2010 ( ‐ ) HYPHEN

U+2027 ( ‧ ) HYPHENATION POINT

U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN

U+30FB ( ・ ) KATAKANA MIDDLE DOT

Of course, for the purpose of IDNA, the ASCII characters are determined by the base spec, but the others allowed. (They are subject to confusability tests, also, but that's a different story.)”

We should also add text that makes it clear that:

Target applications may need to filter these characters. In particular, IDNs have specific requirements on characters that would exclude some of this; some other characters may be restricted on confusability grounds, notably hyphen.

3. There are some other characters that should be added:

“It turns out that the following IDNs:

http://azərbaycan.idntest (Azerbaijani)

http://o‘zbek.idntest (Uzbek)

http://tiếngviệt.idntest (Vietnamese)

http://ייִדיש.idntest (Yiddish)

all fail our safety checks, despite being the normal spelling in the languages in question of their own names. The specific problem characters that were flagged are:

U+0259 LATIN SMALL LETTER SCHWA (limited-use)

U+2018 LEFT SINGLE QUOTATION MARK (non-xid)

U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (technical)

U+05B4 HEBREW POINT HIRIQ (limited-use)”

My recommendations:

U+2018 is a punctuation character, and not valid in IDNA2008. Those tend not to be recommended in xidmodification unless there is good reason. We should add a note about that in the text.
The right Unicode character for a letter character (as in Uzbek) is U+02BB. An ASCII apostrophe wouldn't be allowed in IDNs anyway. We should add a note about that in the text.
We probably should allow the schwa and Hebrew points (Simon says: “FWIW https://bug854041.bugzilla.mozilla.org/attachment.cgi?id=728502 includes several examples of registered domains with Hebrew names including points (e.g. http://נִקוּד.com). But anyway, the example here is Yiddish, and Yiddish orthography uses points much more than Hebrew does. See http://about.museum/idn/yiddish-language.pdf for a very detailed article about issues with IDNs in Yiddish and other languages written in Hebrew script.”
We also should do a comparison of the cldr exemplar characters, and allow what they contain.

4. Markus: “UTS #39 should have a description for how its data is generated.” I agree.

5. We have the request:

“Is there a way to be notified of updates to xidmodifications.txt, other

than writing a script to download it every day and check the date in the

comment?”

I responded: “It gets released with each version of Unicode, which is about once a year. You should also look for announcements of the beta of a new Unicode version. unicode.org has mailing lists, a blog, and tweets...”

I’m not sure what else we can do about notification. Perhaps the committee has some ideas?

6. There are 4 other characters that are in IDNA2008, but not in the inclusion list.

3007; PVALID # IDEOGRAPHIC NUMBER ZERO

Of these, U+3007 is already in the recommended list (it is in XID_Continue). The three others are listed below.

06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND

06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN

0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)

These are allowed in #46, but not in #39, because they are no XID_Continue (they are General_Category=Other_Symbol and General_Category=Modifier_Symbol). These are bizarre additions to IDNA2008, but for consistency I propose that we broaden the definition of the ‘inclusion’ value in #39 to add these three characters, and document the reason: that it is for compatibility with IDNA2008 and consistency with #46. That would mean adding to the data file as:

06FD ; allowed ; inclusion # ARABIC SIGN SINDHI AMPERSAND

06FE ; allowed ; inclusion # ARABIC SIGN SINDHI POSTPOSI...

0375 ; allowed ; inclusion # GREEK LOWER NUMERAL SIGN...

5. We should have a special review of ASCII non-alphanumerics for confusables. We have focused on alphanumerics, but these characters are often used as syntax characters, so the confusables are especially interesting. For example, possibilities to review for # and + are:

U+0023 ( # ) NUMBER SIGN

U+FE5F ( ﹟ ) SMALL NUMBER SIGN

U+FF03 ( ＃ ) FULLWIDTH NUMBER SIGN

U+266F ( ♯ ) MUSIC SHARP SIGN

U+002B ( + ) PLUS SIGN

U+1429 ( ᐩ ) CANADIAN SYLLABICS FINAL PLUS

U+207A ( ⁺ ) SUPERSCRIPT PLUS SIGN

U+208A ( ₊ ) SUBSCRIPT PLUS SIGN

U+FE62 ( ﹢ ) SMALL PLUS SIGN

U+FF0B ( ＋ ) FULLWIDTH PLUS SIGN

and a bit further afield:

U+2795 ( ➕ ) HEAVY PLUS SIGN

U+2629 ( ☩ ) CROSS OF JERUSALEM

U+16ED ( ᛭ ) RUNIC CROSS PUNCTUATION

U+2719 ( ✙ ) OUTLINED GREEK CROSS

U+271A ( ✚ ) HEAVY GREEK CROSS

U+271B ( ✛ ) OPEN CENTRE CROSS

U+1F542 ( 🕂 ) CROSS POMMEE

I suggest that Michel and I get an action to review such characters and add others.