L2/14-187
 
Title:  Cherokee casing decision may break identifier syntax
Date:   July 31, 2014
Author: Ken Whistler
Action: For consideration by the UTC
 
Back in late 2012, there was a discussion of how casing
and case folding interacted with the identifier syntax
of a number of declarative programming languages,
including Prolog, Erlang, and others, which are
formally defined as making a syntax distinction for
the casing of *initial* character of identifiers for variables
as opposed to constants, and which
use other special conventions for identifiers which
start with gc=Lo letters or syllables.
 
Such languages are currently defined using general
syntax rules of the following type, which make reference
specifically to Unicode General_Category values that
make case distinctions:
 
    var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_Id_Start)) ∪ Pc
    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº") |  "." (Ll ∪ Lo)
 
In the context of examining my action item 139-A073,
to review the General_Category assignments for Cherokee
for Unicode 8.0, I went back and examined that thread,
and I think we have a problem here.
 
One of the conclusions from that discussion was that
the existing identifier syntaxes for such languages
would be o.k. with a transition whereby an existing
gc=Lo character became gc=Ll, which would be the
expected outcome of making a unicameral script
bicameral by encoding a new set of *uppercase* letters
for it, and then adding case pairs by making the
existing set of letters be the lowercase for it.
 
However, the syntax cannot handle a transition
whereby an existing gc=Lo character becomes gc=Lu.
In other words, it does not handle the situation
in which an existing unicameral script becomes
bicameral by introducing a new set of *lowercase*
letters for it, and then adding case pairs by making
the existing set of letters be the uppercase for it.
 
However, for a variety of reasons, that is exactly
what we have ended up doing for Cherokee. The
existing characters become the uppercase, and
the new characters become the lowercase.
 
We have proposed to preserve our case folding stability
rule by making an exception and forcing Cherokee to
case fold to uppercase, rather than to lowercase.
However, I don’t think that solves the problem of
formal language syntaxes for declarative languages
which have baked in BNF definitions using Lu and Ll
directly.
 
I don’t think this is quite as simple as saying, oh, well,
for those languages Cherokee identifier usage will
be backwards, so that for Cherokee (and Cherokee only)
in Prolog variables start with an uppercase letter
and constants start with a lowercase letter, instead
of vice versa. The problem is that for any *existing*
program text, a change of Lo --> Lu will create an
actual syntax error in the text and fail compiling.
 
Maybe that outcome is o.k., and everybody will just
accept that Cherokee text is broken in all these
formal language syntaxes, but I think at the very
least we will need to call attention to that fact
explicitly.
 
In order to fix the language syntaxes, they will at
some point presumably need to be modified with
exception sets that subtract out the Cherokee ranges
from the valid sets of characters for both constants
and variables in those languages. In fact, the original
discussion thread raised the issue that it might
make sense to limit them more sharply to avoid baking
in syntax based on casing for relatively recently
encoded and limited-use scripts which might not
actually be “ready for prime time” for that kind of
syntax.
 
In any case, I think this oddity and exception that
we are going to put into place for Cherokee should
be very explicitly documented in UAX #31 for Unicode 8.0,
as it directly impacts the use of Cherokee characters
in identifiers for a significant class for formal
language syntaxes.