L2/08-099

Notes on IDN Meeting

M. Davis

Here are my notes on the recent informal IDN meeting. I also list some open questions.

Contents

Notes on Consensus

1. Protocol.

In the protocol, there will be precisely three categories:


These categories must be checked both by the Registry and Resolver.

We will also distinguish a set of contextual constraints, as per the following table. These will be described in detail in either the Protocol document or the Bidi document. Here are the constraints, and where they must be checked.

Constraint Registry Resolver
BIDI restrictions (see idna-bidi) MUST SHOULD
isNFC MUST MUST*
Forbid initial Combining Mark MUST SHOULD*
Join Controls in limited contexts* MUST MUST

Notes:

For each successive version of Unicode, code points will move from Unassigned into either Allowed or Never. The choice between the latter is on the basis of the Tables rules.

Characters may move between Allowed and Never or have additional contextual requirements added, but only in the case of disasters.

The rules for Allowed use Unicode properties plus a small list of exceptions. The rules are based on those in the Tables document 04 for {ALWAYS, MAYBE YES, MAYBE NO, CONTEXTUAL} (see http://tools.ietf.org/html/draft-faltstrom-idnabis-tables, 04).

2. Registry Advice.

Registries need to have a source of information as to which characters are appropriate for which languages / environments. This is decoupled from the Protocol, and is not a gating item for the Protocol's release, but should be worked on actively in parallel. It may not be an RFC; perhaps being hosted by an organization such as UNESCO (or IANA, ICANN, IETF?), but probably with active participation by other organizations that can supply information: UNGEGN, Unicode, and so on.

3. Preprocessing RFC

We did not take up the question of a Preprocessing RFC except insofar as to decide that it was also not a gating item for the Protocol's release.

4. Working Group

We agreed to recommend the formation of a working group, with Vint as the chair. He can call on others to share the load wherever necessary. This WG is expected to be of fast setup, short duration, and probably not hold face to face meetings.

Actions:

Misc Notes and Questions

Documents

(not yet reflecting the above consensus at the time of this writing)

http://tools.ietf.org/html/draft-klensin-idnabis-issues
http://tools.ietf.org/html/draft-klensin-idnabis-protocol
http://tools.ietf.org/html/draft-faltstrom-idnabis-tables
http://tools.ietf.org/html/draft-alvestrand-idna-bidi

Table Rules (draft restructuring)

Here are draft table rules after a possible simplification of http://tools.ietf.org/html/draft-faltstrom-idnabis-tables, based on the above consensus. It does not represent a consensus on the tables -- it is just my interpretation of how the consensus could be reflected in the tables.


It uses Unicode Regex notation, where [:property=value:] is the set of characters having the specified value for the specified property. However, that notation is not necessary for any final document -- it is only used here for simplicity in relating to Unicode properties. (Actually, Unicode regex also allows Perl syntax, such as \p{Cn}, if preferred.) Note: the order of boolean set operations is important.


The Categories follow the draft Tables 04 document.



IDN=Allowed is defined as

Unicode Regex Description Tables 04
  [[:L:][:Mn:][:Mc:][:Nd:]] // restrict to only letters, marks, numbers Category A
- [:NFKC_QC=N:] // minus characters unstable under NFKC Category B
- [:^isCaseFolded:] // minus characters unstable under case folding Category C
- [:di:] // minus default-ignorables Category D
- [:IDN_Exceptions=Disallowed:] // minus exceptional exclusions (currently empty) New (empty)
+ [:IDN_Exceptions=Allowed:] // plus exceptional inclusions (see below) Category H*
+ [:Join_Control:] // plus join controls (withcontextual constraints) Category J*
+ [a-z0-9\-] // ASCII LDH (only the '-' is actually significant) Category G

Category J is changed from {Cf} to just Join_Controls.


U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER


Category J is still under debate. Tables 04 has the following contents.


U+00B7 ( · ) MIDDLE DOT
U+05F3 ( ‎׳‎ ) HEBREW PUNCTUATION GERESH
U+05F4 ( ‎״‎ ) HEBREW PUNCTUATION GERSHAYIM
U+3005 ( 々 ) IDEOGRAPHIC ITERATION MARK

U+3007 ( 〇 ) IDEOGRAPHIC NUMBER ZERO
U+303B
( 〻 ) VERTICAL IDEOGRAPHIC ITERATION MARK

U+30FB ( ・ ) KATAKANA MIDDLE DOT


My opinion:

The Unicode utilities can be used to view the above, for example:

Unassigned is defined as

Unicode Regex Description Tables 04
[:Cn:] // unassigned code points Category K


Disallowed is defined as

Unicode Regex Description
 [\u0000-\U0010FFFF] // All Unicode code points
- Unassigned // minus Unassigned
- Allowed // minus Allowed