L2/06-353

From: Mark Davis
Date: 2006-10-24
Subject: ZWJ/NJ in identifiers

Eric and I had an action regarding ZWJ/NJ. Here is a strawman document for the meeting.

Normally format characters are excluded from identifiers, because their usage allows two apparently identical strings to represent different underlying strings. However, for historical reasons, certain format characters are used to mark visible distinctions in particular cases, distinctions that are necessary for important semantic distinctions in certain languages. Identifier systems that attempt to provide more natural representations of terms, such as geographic names, company names, and so on should consider allowing these characters, but limited to the following contexts.

The match to the regular expressions below must also only consist of characters from a single script (after ignoring Common and Inherited Script characters).

ZWNJ in the following contexts:

  1. At a position in a string that causes adjacent characters to break a cursive connection. That is, in the context based on the Arabic Shaping using the following regular expression:
  2. In a conjunt context. that is a sequence of the form 
ZWJ in the following contexts:
  1. In a conjunt context. that is a sequence of the form 
Because of the rarity of these characters, this does not have any appreciable performance implications. Note that while it would be possible to make the contexts listed above somewhat narrower, in practice there is no advantage to that, and the above is computationally simpler.