UAX 31 Changes

L2/09-109
From: Mark Davis
Date: 2009-3-28

I suggest the following changes in UAX 31.

1. Fix ambiguous variables

There are suggested rules for using ZWJ and ZWNJ in http://unicode.org/draft/reports/tr31/tr31.html#Layout_and_Format_Control_Characters

In those rules, we use the variable $L for two different entities in the rules: Left Joining, and Letter (for Indic). While they are in separate contexts, it would be much clearer if we didn't have the overlap. There are a few possible alternatives; I suggest:
 • For the Joining specifications of ZWJ/ZWNJ, change $L, $R to $LJ, $RJ

2. Add Default Ignorable Code Points to Table 4 Candidate Characters for Exclusion from Identifiers

In http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments,

add a row:

 [:Default_Ignorable_Code_Point=True:]  Default Ignorable Code Points (See Section 2.3)

[Rationale: we already say that DIs should be excluded, with certain exceptions in Section 2.3, which has a lot of detail on the topic. This just makes that relationship more visible.]

3. Add Unicode 5.2 Characters to Table 3/4 (Candidates for Inclusion/Exclusion)

Add to Table 4 (Exclusion) the following scripts (this is a rough cut, so feedback is welcome):

Archaic / Historic
 • Old Turkic
 • Old South Arabian
 • Imperial Aramaic
 • Inscriptional Parthian
 • Inscriptional Pahlavi
 • Avestan
 • Egyptian Hieroglyphs
 • Javanese
Limited Use
 • Samaritan
 • Kaithi
 • Tai Viet
 • Bamum
 • Lisu
Add the following to Table 5. Recommended Scripts
 • Meetei Mayek
 • Tai Tham

4. Add U+0640 ( ‎ـ‎ ) ARABIC TATWEEL as a candidate character for exclusion.


We have the following tables in http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments
 • Table 3. Candidate Characters for Inclusion in Identifiers
 • Table 4. Candidate Characters for Exclusion from Identifiers
A. I suggest adding a row to Table 4, being

[\u0640]   Arabic Tatweel

B. Alternatively, one could break Table 4 into two tables:

Table 4a. Candidate Characters Identified by CodePointfor Exclusion from Identifiers
Containing only Tatweel

Table 4b. Candidate Characters Identified by Property for Exclusion from Identifiers
Containing the current Table 4 contents

(Ken favors a two table solution; I think it is simpler with one.)

5. Add Characters from IDNA Tables Document

The IDNA tables document (draft) contains certain exceptions that we should review, in http://tools.ietf.org/html/draft-ietf-idnabis-tables#section-2.6.

The following characters are not in the Unicode identifier definition XID_Continue (after subtracting characters that are affected by case folding and NFKC), nor are in the Candidates for Inclusion.

Greek And Coptic - Numeral signs
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN

Arabic - Signs for Sindhi
U+06FD ( ‎۽‎ ) ARABIC SIGN SINDHI AMPERSAND
U+06FE ( ‎۾‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN

Tibetan - Marks and signs
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG

Katakana - Conjunction and length marks
U+30FB ( ・ ) KATAKANA MIDDLE DOT

Of them, I'd recommend that we add U+30FB ( ・ ) KATAKANA MIDDLE DOT to Table 3. Candidate Characters for Inclusion in Identifiers, since it serves a function somewhat like an underbar. The others have gotten into the IDNA specification (draft), but there doesn't seem to be any compelling rationale for that. However, others may know more about them and present good reasons for inclusion into UAX#31.

Note that the following is part of Pattern_Syntax, and thus not part of XID_Continue. Pattern_Syntax is immutable, and required to be disjoint from identifiers, and yet this character was added in that range, which was probably a mistake.

Supplemental Punctuation - Medievalist punctuation
U+2E2F ( ⸯ ) VERTICAL TILDE

Of the characters that Unicode has, and IDNA doesn't, I don't see any need to make any changes. Some of them are principled differences, like the omission of connector punctuation, and others are not, like the omission of Hangul Jamo.

5.1 Background

For completeness, the following lists the exceptions in the 05 version of that document, organized by type.
*PVALID: // would otherwise have been DISALLOWED

  00DF; PVALID   # LATIN SMALL LETTER SHARP S
  03C2; PVALID   # GREEK SMALL LETTER FINAL SIGMA
  06FD; PVALID   # ARABIC SIGN SINDHI AMPERSAND
  06FE; PVALID   # ARABIC SIGN SINDHI POSTPOSITION MEN
  0F0B; PVALID   # TIBETAN MARK INTERSYLLABIC TSHEG
  3007; PVALID   # IDEOGRAPHIC NUMBER ZERO

*CONTEXTO: // would otherwise have been DISALLOWED
  00B7; CONTEXTO  # MIDDLE DOT
  0375; CONTEXTO  # GREEK LOWER NUMERAL SIGN (KERAIA)
  05F3; CONTEXTO  # HEBREW PUNCTUATION GERESH
  05F4; CONTEXTO  # HEBREW PUNCTUATION GERSHAYIM
  30FB; CONTEXTO  # KATAKANA MIDDLE DOT

*CONTEXTO: // would otherwise have been PVALID
  002D; CONTEXTO  # HYPHEN-MINUS
  02B9; CONTEXTO  # MODIFIER LETTER PRIME
  0660; CONTEXTO  # ARABIC-INDIC DIGIT ZERO
  0661; CONTEXTO  # ARABIC-INDIC DIGIT ONE
  0662; CONTEXTO  # ARABIC-INDIC DIGIT TWO
  0663; CONTEXTO  # ARABIC-INDIC DIGIT THREE
  0664; CONTEXTO  # ARABIC-INDIC DIGIT FOUR
  0665; CONTEXTO  # ARABIC-INDIC DIGIT FIVE
  0666; CONTEXTO  # ARABIC-INDIC DIGIT SIX
  0667; CONTEXTO  # ARABIC-INDIC DIGIT SEVEN
  0668; CONTEXTO  # ARABIC-INDIC DIGIT EIGHT
  0669; CONTEXTO  # ARABIC-INDIC DIGIT NINE
  06F0; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT ZERO
  06F1; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT ONE
  06F2; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT TWO
  06F3; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT THREE
  06F4; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT FOUR
  06F5; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT FIVE
  06F6; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT SIX
  06F7; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT SEVEN
  06F8; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT EIGHT
  06F9; CONTEXTO  # EXTENDED ARABIC-INDIC DIGIT NINE
  0483; CONTEXTO  # COMBINING CYRILLIC TITLO
  3005; CONTEXTO  # IDEOGRAPHIC ITERATION MARK
  303B; CONTEXTO  # VERTICAL IDEOGRAPHIC ITERATION MARK

*DISALLOWED: // would otherwise have been PVALID
  302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
  302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK

5.2 Characters in IDNA draft

Here is the current set, as of the current draft and Unicode 5.1. You can paste into http://unicode.org/cldr/utility/list-unicodeset.jsp to explore, or compare against XID_Continue.

[\-0-9a-z·ß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżžƀƃƅƈƌƍƒƕƙ -ƛƞơƣƥƨƪƫƭưƴƶƹ-ƻƽ-ǃǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ -ʯʹ-ˁˆ-ˑˬˮ̀-̿͂͆-͎͐-ͯͱͳ͵ͷͻ-ͽΐά-ώϗϙϛϝϟϡϣϥϧϩϫϭϯϳϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁ҃-҇ҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣՙա -ֆ֑-ׇֽֿׁׂׅׄא-תװ-״ؐ-ؚء-ٞ٠-٩ٮ-ٴٹ-ۓە-ۜ۟-۪ۨ-ۿܐ-݊ݍ-ޱ߀-ߵߺँ-ह़-्ॐ-॔ॠ-ॣ०-९ॱॲॻ-ॿঁ- ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗৠ-ৣ০-ৱਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਵਸਹ਼ਾ-ੂੇੈੋ-੍ੑੜ੦-ੵઁ-ઃઅ-ઍએ-ઑઓ -નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍ୖୗୟ-ୣ୦-୯ୱஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந -பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఁ-ఃఅ-ఌఎ-ఐఒ-నప-ళవ-హఽ-ౄె-ైొ-్ౕౖౘౙౠ-ౣ౦-౯ಂಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼ -ೄೆ-ೈೊ-್ೕೖೞೠ-ೣ೦-೯ംഃഅ-ഌഎ-ഐഒ-നപ-ഹഽ-ൄെ-ൈൊ-്ൗൠ-ൣ൦-൯ൺ-ൿංඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟෲෳ ก-าิ-ฺเ-๎๐-๙ກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-າິ-ູົ-ຽເ-ໄໆ່-ໍ໐-໙ༀ་༘༙༠-༩༹༵༷༾-གང-ཇཉ-ཌཎ-དན -བམ-ཛཝ-ཨཪ-ཬཱིེུ-ྀྂ-྄྆-ྋྐ-ྒྔ-ྗྙ-ྜྞ-ྡྣ-ྦྨ-ྫྭ-ྸྺ-ྼ࿆က-၉ၐ-႙ა-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ -ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፟ᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-ឳា-៓ៗៜ៝០ -៩᠐-᠙ᠠ-ᡷᢀ-ᢪᤀ-ᤜᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦩᦰ-ᧉ᧐-᧙ᨀ-ᨛᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᮪ᮮ-᮹ᰀ-᰷᱀-᱉ᱍ-ᱽᴀ-ᴫᴯᴻᵎᵫ-ᵷᵹ- ᶚ᷀-᷿ᷦ᷾ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ -ẙẜẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰὲὴὶὸὺὼᾰᾱᾶῆῐ -ῒῖῗῠ-ῢῤ-ῧῶ‌‍ⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⴀ -ⴥⴰ-ⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿⸯ々-〇〪-〭〱-〵〻〼ぁ-ゖ゙゚ゝゞァ-ヾㄅ-ㄭㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘫꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙣꙥꙧꙩꙫꙭ-꙯꙼꙽ꙿꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꜗ-ꜟꜣꜥꜧꜩꜫꜭꜯ- ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞈꞌꟻ-ꠧꡀ-ꡳꢀ-꣄꣐-꣙꤀-꤭ꤰ-꥓ꨀ-ꨶꩀ-ꩍ꩐-꩙가 -힣﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧-﨩ﬞ︠-︦ﹳ𐀁-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐇽𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌰 -𐍀𐍂-𐍉𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐨-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿𐤀-𐤕𐤠-𐤹𐨀-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨳𐨸-𐨿𐨺𒀀-𒍮𠀀-𪛖]