L2/06-346 Source: Mark Davis Date: 2006-10-23 Subject: IDNA proposal ======= The document draft-faltstrom-idnabis-tables-00.txt drastically reduces which characters can be used in identifiers in IDNA and probably other IETF protocols. Some of these restrictions we are on record as agreeing with -- such as disallowing symbols and punctuation. Others I think are very problematic. To allow for easier assessment of this document, I generated a comparison at: ˊ http://www.macchiato.com/idna/idnaComparison.txt This compares the proposal to the Unicode recommendations. Caveats: I found draft-faltstrom-idnabis-tables-00.txt very difficult to interpret. It does not allow, at all, easy parsing (with linebreaks in the middle of lines). However, Patrik kindly supplied me with an HTML version. Dumping that into plaintext allowed for parsing (with few kluges to get around section headings, headers, etc.) of section 4, which has the lists of characters. Even then, the document is quite difficult to interpret. The "Include?" field does not have clearly defined values. "Maybe" is used one place for "codepoints that in general should be allowed to be used in IDN", and in another as "codepoints that in general should not be allowed to use in IDN, but some scripts do have codepoints in this class that should be carefully considered as exceptions". And the distinction between "Maybe", "Possibly", and "Possibly not" escapes me. The lists in section 2.1 appear important, but are not easily parsable, and appear to be overridden by section 4 (" If, instead, we examine blocks of codepoints and the individual codepoints, retaining the decisions made based on the classes above, we get the following result."). So I'm just including the section 4 values, even though there are obvious omissions, like almost 85,000 characters! The items of particular concern are the characters that are disallowed by this list (or possibly disallowed) that are allowed in Unicode identifiers -- even those that don't have any security restrictions in Unicode. I would urge people to review these characters to see which languages are being disallowed, and discuss via email. I also presume that the data file will undergo many revisions, since it is in extremely rough shape. Unfortunately, there are also few reasons given for excluding or including particular characters; given the impact on different language communities, one can only hope that reasons will be supplied in future versions. Because I expect revisions, I'll try to keep the data more or less up to date, depending on just how hard it is to parse the file. Here is a summary (current data): # Unicode Identifier Status # N Disallowed - never allowed - not LMN: Letter, Mark, or Number # I Allowed only on input (if then) - not case-folded, not NFKC # R Restricted - only allow after careful consideration # N Unrestricted - allowed always # # IDNA Proposal Identifier Status # As per file, plus: # Exclude* Not mentioned in section 4. # Summary # Value: I & Exclude Total: 23 # Value: I & Exclude* Total: 3,690 # Value: I & Include Total: 41 # Value: I & Input Total: 418 # Value: I & Maybe Total: 6 # Value: I & Possibly_not Total: 45 # Value: N & Exclude Total: 286 # Value: N & Exclude* Total: 41 # Value: N & Maybe Total: 3 # Value: N & Possibly_not Total: 3 # Value: R & Exclude Total: 27 # Value: R & Exclude* Total: 2,757 # Value: R & Include Total: 99 # Value: R & Maybe Total: 9 # Value: R & Possibly_not Total: 16 # Value: U & Exclude Total: 134 # Value: U & Exclude* Total: 84,872 # Value: U & Include Total: 400 # Value: U & Maybe Total: 1,125 # Value: U & Possibly_not Total: 394