Eric and I had an action regardi

L2/06-353

From: Mark Davis
Date: 2006-10-24
Subject: ZWJ/NJ in identifiers

Eric and I had an action regarding ZWJ/NJ. Here is a strawman document for the meeting.

Normally format characters are excluded from identifiers, because their usage allows two apparently identical strings to represent different underlying strings. However, for historical reasons, certain format characters are used to mark visible distinctions in particular cases, distinctions that are necessary for important semantic distinctions in certain languages. Identifier systems that attempt to provide more natural representations of terms, such as geographic names, company names, and so on should consider allowing these characters, but limited to the following contexts.

The match to the regular expressions below must also only consist of characters from a single script (after ignoring Common and Inherited Script characters).

ZWNJ in the following contexts:

At a position in a string that causes adjacent characters to break a cursive connection. That is, in the context based on the Arabic Shaping using the following regular expression:
- /$R $T? ZWNJ $T? $L/
  where:
  - $T = [:Joining_Type=Transparent:]
  - $R = [[:Joining_Type=Dual_Joining:][: Joining_Type=Right_Joining:]]
  - $L = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]
- Example: Farsi <Noon, Alef, Meem, Heh, Alef, Farsi Yeh>. Without a ZWNJ, it translates to "names"; with a ZWNJ between Heh and Alef, it means "a letter".
In a conjunt context. that is a sequence of the form
- /$L $M* $V ZWNJ $M* $L/
  where:
  - $L = [:General_Category=Letter:]
  - $M = [:General_Category=Mark:]
  - $V = [:Canonical_Combining_Class=Virama:]
- Example: in Malayalam, we recommend the use of ZWJ and ZWNJ to make distinctions involving cillu forms. (See p. 337 of TUS 5.0.) The status changes once the cillu forms are separately encoded in 5.1.

ZWJ in the following contexts:

In a conjunt context. that is a sequence of the form
- /$L $M* $V ZWJ $M* $L/
  where:
  - $L = [:General_Category=Letter:]
  - $M = [:General_Category=Mark:]
  - $V = [:Canonical_Combining_Class=Virama:]
- Example: Devanagari RA + VIRAMA + ZWJ + KA
- Example: Sinhala 'ශ්රී ලංකා' (the country 'Sri Lanka'), which uses both a space character and a ZWJ. Removing the space gives 'ශ්රීලංකා' which is still readable, but removing the ZWJ completely modifies the appearance of the 'Sri' cluster and gives the following text: 'ශ්රී ලංකා'.

Because of the rarity of these characters, this does not have any appreciable performance implications. Note that while it would be possible to make the contexts listed above somewhat narrower, in practice there is no advantage to that, and the above is computationally simpler.