Tweaks to UTS#39 data and text

L2/13-070

Re: Tweaks to UTS#39 data and text

From: Mark Davis

I was contacted by people from Mozilla, who had found some problems in the text and data. I propose the following:

1. We clarify the use of Simplified & Traditional tests.

I responded to a question about these as follows. We should add an explanation of this.

“A. The test can only be applied if the characters are meant to be chinese. So " 写真だけの結婚式" is Japanese, and shouldn't be tested.

B. The test for S vs T needs to be not whether the character has a T or S variant, but whether the character is an S or T variant. In any event, we need to be much clearer in that section exactly how to use Unihan.”

2. We should include some characters we are wrongly excluding.

I responded a question on U+30FB KATAKANA MIDDLE DOT as follows:

“This appears to be a production problem. The list marked as:

xxxx ; allowed ; inclusion

should a be those characters that are in http://www.unicode.org/reports/tr39/#Identifier_Modification_Key under 'inclusion'.

Those characters should match the characters in http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers. However, they do not reflect them. The missing characters are:

U+0027 ( ' ) APOSTROPHE

U+003A ( : ) COLON

U+058A ( ֊ ) ARMENIAN HYPHEN

U+2010 ( ‐ ) HYPHEN

U+2027 ( ‧ ) HYPHENATION POINT

U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN

U+30FB ( ・ ) KATAKANA MIDDLE DOT

Of course, for the purpose of IDNA, the ASCII characters are determined by the base spec, but the others allowed. (They are subject to confusability tests, also, but that's a different story.)”

3. There are some other characters that should be added:

“It turns out that the following IDNs:

http://azərbaycan.idntest (Azerbaijani)

http://o‘zbek.idntest (Uzbek)

http://tiếngviệt.idntest (Vietnamese)

http://ייִדיש.idntest (Yiddish)

all fail our safety checks, despite being the normal spelling in the languages in question of their own names. The specific problem characters that were flagged are:

U+0259 LATIN SMALL LETTER SCHWA (limited-use)

U+2018 LEFT SINGLE QUOTATION MARK (non-xid)

U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE (technical)

U+05B4 HEBREW POINT HIRIQ (limited-use)”

My comments:

U+2018 is a punctuation character, and not valid in IDNA2008. Those tend not to be recommended in xidmodification unless there is good reason. We should add a note about that in the text.
The right Unicode character for a letter character (as in Uzbek) is U+02BB. An ASCII apostrophe wouldn't be allowed in IDNs anyway. We should add a note about that in the text.
We probably should allow the schwa and Hebrew points (Simon says: “FWIW https://bug854041.bugzilla.mozilla.org/attachment.cgi?id=728502 includes several examples of registered domains with Hebrew names including points (e.g. http://נִקוּד.com). But anyway, the example here is Yiddish, and Yiddish orthography uses points much more than Hebrew does. See http://about.museum/idn/yiddish-language.pdf for a very detailed article about issues with IDNs in Yiddish and other languages written in Hebrew script.”
We also should do a comparison of the cldr exemplar characters, and allow what they contain.

4. Markus: “UTS #39 should have a description for how its data is generated.” I agree.

5. We have the request:

“Is there a way to be notified of updates to xidmodifications.txt, other

than writing a script to download it every day and check the date in the

comment?”

I responded: “It gets released with each version of Unicode, which is about once a year. You should also look for announcements of the beta of a new Unicode version. unicode.org has mailing lists, a blog, and tweets...”

I’m not sure what else we can do about notification. Perhaps the committee has some ideas?