L2/06-346

Source: Mark Davis
Date: 2006-10-23
Subject: IDNA proposal

=======

The document draft-faltstrom-idnabis-tables-00.txt drastically reduces which
characters can be used in identifiers in IDNA and probably other IETF protocols.
Some of these restrictions we are on record as agreeing with -- such as
disallowing symbols and punctuation. Others I think are very problematic.

To allow for easier assessment of this document, I generated a comparison at:

ˊ http://www.macchiato.com/idna/idnaComparison.txt

This compares the proposal to the Unicode recommendations.

Caveats: I found draft-faltstrom-idnabis-tables-00.txt very difficult to
interpret. It does not allow, at all, easy parsing (with linebreaks in the
middle of lines). However, Patrik kindly supplied me with an HTML version.
Dumping that into plaintext allowed for parsing (with few kluges to get around
section headings, headers, etc.) of section 4, which has the lists of characters.
Even then, the document is quite difficult to interpret. The "Include?" field
does not have clearly defined values. "Maybe" is used one place for "codepoints
that in general should be allowed to be used in IDN", and in another as
"codepoints that in general should not be allowed to use in IDN, but some scripts
do have codepoints in this class that should be carefully considered as
exceptions". And the distinction between "Maybe", "Possibly", and "Possibly not"
escapes me.

The lists in section 2.1 appear important, but are not easily parsable, and appear
to be overridden by section 4 (" If, instead, we examine blocks of codepoints
and the individual codepoints, retaining the decisions made based on the classes
above, we get the following result."). So I'm just including the section 4 values,
even though there are obvious omissions, like almost 85,000 characters!

The items of particular concern are the characters that are disallowed by this list
(or possibly disallowed) that are allowed in Unicode identifiers -- even those that
don't have any security restrictions in Unicode.

I would urge people to review these characters to see which languages are being
disallowed, and discuss via email. I also presume that the data file will undergo
many revisions, since it is in extremely rough shape. Unfortunately, there are
also few reasons given for excluding or including particular characters; given
the impact on different language communities, one can only hope that reasons
will be supplied in future versions.

Because I expect revisions, I'll try to keep the data more or less up to date,
depending on just how hard it is to parse the file.

Here is a summary (current data):

# Unicode Identifier Status
#	N	Disallowed - never allowed - not LMN: Letter, Mark, or Number

#	I	Allowed only on input (if then) - not case-folded, not NFKC
#	R	Restricted - only allow after careful consideration
#	N	Unrestricted - allowed always
#
# IDNA Proposal Identifier Status
#	As per file, plus:

#	Exclude*	Not mentioned in section 4.

# Summary

# Value: I & Exclude          Total:       23
# Value: I & Exclude*         Total:    3,690
# Value: I & Include          Total:       41

# Value: I & Input            Total:      418
# Value: I & Maybe            Total:        6
# Value: I & Possibly_not     Total:       45
# Value: N & Exclude          Total:      286
# Value: N & Exclude*         Total:       41

# Value: N & Maybe            Total:        3
# Value: N & Possibly_not     Total:        3
# Value: R & Exclude          Total:       27
# Value: R & Exclude*         Total:    2,757
# Value: R & Include          Total:       99

# Value: R & Maybe            Total:        9
# Value: R & Possibly_not     Total:       16
# Value: U & Exclude          Total:      134
# Value: U & Exclude*         Total:   84,872
# Value: U & Include          Total:      400

# Value: U & Maybe            Total:    1,125
# Value: U & Possibly_not     Total:      394