L2/06-353
From:
Mark Davis
Date: 
2006-10-24
Subject: 
ZWJ/NJ in identifiers
Eric 
and I had an action regarding ZWJ/NJ. 
Here is a strawman document for the meeting.
Normally format characters are excluded from identifiers, because their usage 
allows two apparently identical strings to represent different underlying 
strings. However, for historical reasons, certain format characters are used to 
mark visible distinctions in particular cases, distinctions that are necessary 
for important semantic distinctions in certain languages. Identifier systems 
that attempt to provide more natural representations of terms, such as 
geographic names, company names, and so on should consider allowing these 
characters, but limited to the following contexts. 
The match to the regular expressions below must also only consist of characters 
from a single script (after ignoring Common and Inherited Script characters).
ZWNJ in the following contexts:
	- At a position in a string that causes adjacent characters to break a 
	cursive connection. That is, in the context based on the Arabic Shaping 
	using the following regular expression: 
	
		- /$R $T? ZWNJ $T? $L/
		where:
 
			- $T = [:Joining_Type=Transparent:]
 
			- $R = [[:Joining_Type=Dual_Joining:][: Joining_Type=Right_Joining:]] 
			
 
			- $L = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]
  
		
		 
		- Example: Farsi <Noon, Alef, 
		Meem, Heh, Alef, Farsi Yeh>. Without a ZWNJ, it translates to "names"; 
		with a ZWNJ between Heh and Alef, it means "a letter".
  
	
	 
	- In a conjunt context. that is a sequence of the form  
	
		- /$L $M* $V ZWNJ $M* $L/
		where:
 
			- $L = [:General_Category=Letter:]
 
			- $M = [:General_Category=Mark:]
 
			- $V = [:Canonical_Combining_Class=Virama:]
  
		
		 
		- Example: in Malayalam, we 
		recommend the use of ZWJ and 
		ZWNJ to make distinctions involving cillu forms. (See p. 337 of TUS 
		5.0.) The status changes once the cillu forms are separately encoded in 
		5.1.
 
	
	 
ZWJ in the following contexts:
	- In a conjunt context. that is a sequence of the form 
		- /$L $M* $V ZWJ $M* $L/
		where:
 
			- $L = [:General_Category=Letter:]
 
			- $M = [:General_Category=Mark:]
 
			- $V = [:Canonical_Combining_Class=Virama:]
  
		
		 
		- Example: Devanagari RA + 
		VIRAMA + ZWJ + KA 
 
		- Example: Sinhala 'ශ්රී ලංකා' 
		(the country 'Sri Lanka'), which uses both a space character and a
		ZWJ. Removing the space gives 
		'ශ්රීලංකා' which is still readable, but removing the
		ZWJ completely modifies the 
		appearance of the 'Sri' cluster and gives the following text: 'ශ්රී 
		ලංකා'. 
 
	
	 
Because of the rarity of these characters, this does not have any appreciable 
performance implications. Note that while it would be possible to make the 
contexts listed above somewhat narrower, in practice there is no advantage to 
that, and the above is computationally simpler.