L2/14-187 Title: Cherokee casing decision may break identifier syntax Date: July 31, 2014 Author: Ken Whistler Action: For consideration by the UTC Back in late 2012, there was a discussion of how casing and case folding interacted with the identifier syntax of a number of declarative programming languages, including Prolog, Erlang, and others, which are formally defined as making a syntax distinction for the casing of *initial* character of identifiers for variables as opposed to constants, and which use other special conventions for identifiers which start with gc=Lo letters or syllables. Such languages are currently defined using general syntax rules of the following type, which make reference specifically to Unicode General_Category values that make case distinctions: var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_Id_Start)) ∪ Pc atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº") | "." (Ll ∪ Lo) In the context of examining my action item 139-A073, to review the General_Category assignments for Cherokee for Unicode 8.0, I went back and examined that thread, and I think we have a problem here. One of the conclusions from that discussion was that the existing identifier syntaxes for such languages would be o.k. with a transition whereby an existing gc=Lo character became gc=Ll, which would be the expected outcome of making a unicameral script bicameral by encoding a new set of *uppercase* letters for it, and then adding case pairs by making the existing set of letters be the lowercase for it. However, the syntax cannot handle a transition whereby an existing gc=Lo character becomes gc=Lu. In other words, it does not handle the situation in which an existing unicameral script becomes bicameral by introducing a new set of *lowercase* letters for it, and then adding case pairs by making the existing set of letters be the uppercase for it. However, for a variety of reasons, that is exactly what we have ended up doing for Cherokee. The existing characters become the uppercase, and the new characters become the lowercase. We have proposed to preserve our case folding stability rule by making an exception and forcing Cherokee to case fold to uppercase, rather than to lowercase. However, I don’t think that solves the problem of formal language syntaxes for declarative languages which have baked in BNF definitions using Lu and Ll directly. I don’t think this is quite as simple as saying, oh, well, for those languages Cherokee identifier usage will be backwards, so that for Cherokee (and Cherokee only) in Prolog variables start with an uppercase letter and constants start with a lowercase letter, instead of vice versa. The problem is that for any *existing* program text, a change of Lo --> Lu will create an actual syntax error in the text and fail compiling. Maybe that outcome is o.k., and everybody will just accept that Cherokee text is broken in all these formal language syntaxes, but I think at the very least we will need to call attention to that fact explicitly. In order to fix the language syntaxes, they will at some point presumably need to be modified with exception sets that subtract out the Cherokee ranges from the valid sets of characters for both constants and variables in those languages. In fact, the original discussion thread raised the issue that it might make sense to limit them more sharply to avoid baking in syntax based on casing for relatively recently encoded and limited-use scripts which might not actually be “ready for prime time” for that kind of syntax. In any case, I think this oddity and exception that we are going to put into place for Cherokee should be very explicitly documented in UAX #31 for Unicode 8.0, as it directly impacts the use of Cherokee characters in identifiers for a significant class for formal language syntaxes.