Re: Rationale wanted for Unicode identifier rules

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 01 2000 - 14:47:22 EST


John Cowan asked:

> (Still waiting for my bookstore to get 3.0 book.)
>
> Section 5.14 of 2.0 says:
>
> # The formal syntax provided here is intended to capture the general
> # intent that an identifier consists of a string of characters that starts
> # with a letter or an ideograph, and then follows with any number of letters,
> # ideographs, digits, or underscores.
>
> Can anyone give me a rationale for rejecting the following argument:
>
> > There are some [syntax] characters we know we need to prohibit [in
> > identifiers, such as +, -, etc.], as well as a couple of ranges of
> > control characters, but other than that I'm not sure why it's worth
> > bothering.

First of all, identifier syntax has traditionally been implicitly based
on character properties, although seldom implemented that way. For
ASCII-based languages, that was relatively simple. You had a-zA-Z in
one class, 0..9 in another class, and then everything else was "punctuation",
most of which was drafted into the formal syntax for representation of
operators and delimiters.

When identifier syntaxes were extended for Asian character sets -- notably
Japanese -- kana and ideographs were treated as "letters" -- essentially an
extension of the letter property. Some formal language standards, such
as the SQL standard, specifically made this a part of their normative identifier
syntax: identifiers were to consist of sequences of letters, *syllables*,
or *ideographs* (and non-initial digits).

We now have fairly stable lists of character properties for all of
Unicode, and the question of identifier extension for the universal
character set essentially boils down to a couple options:

  A. Pursue the direction implied by the Asian extensions; i.e. allow
     identifiers to be "wordlike" entities, but disallow miscellaneous
     symbols in them.

  B. Restrict just a small number of "syntax" characters from occurrence
     in identifiers, but allow anything else.

Direction A is formally endorsed by ISO TR 10176, Guidelines for the preparation
of programming language standards. It is also the direction taken by Java.

Why A, rather than B?

First of all, the resultant set of possible identifiers are more wordlike.
And since a *reasonable* use of identifiers is to make formal language
program text legible and maintainable, this is a *good* thing. There is
no advantage to a formal programming language standard allowing the
Unicode equivalent of "©§¢¥¡¶¶»»»»" (and insert random dingbats and
box-drawing characters at your pleasure) to be a valid identifier.

You could argue that the compiler doesn't give a hoot, and that in some
instances program text is generated and never seen by human eyes -- so
why make this constraint? But for generated identifiers, "a457637271"
is just as functional as "©§¢¥¡¶¶»»»»" would be -- and might be more
debuggable just in case. Effectively this is no real constraint on
the formal syntax. Nobody is ever going to run out of generable identifiers,
even without the use of dingbats and box-drawing characters.

Secondly, it is actually *easier* to implement A than to implement B.
The Unicode 3.0 statement of identifier syntax is even simpler than
that proposed in Unicode 2.0 -- it depends merely on the general category
character property, easily implementable from the Unicode Character Database
in a very efficient way. For particular languages, you then make exceptions
for the usual suspects ("_", "&", "@", etc.). The net result is easy to
specify and very fast and simple to implement in a lexer.

The alternative along the lines of B could also be implemented with a fast,
efficient table, of course, but the problem is determining exactly *what*
the contents of that table should be, and ensuring interoperability with
other people's interpretation of how to do B. At first it might seem
obvious. Just exempt the ASCII "operators" and "delimiters" -- but then
the trouble begins. Should CJK brackets be treated as brackets or be
allowed in identifiers simply because they are "other junk" outside the
range of ASCII formal syntax definitions? How about CJK quote marks -- should
they be treated as quote marks or just be treated as "more junk" includable
in identifiers? How about U+2212 MINUS SIGN -- surely an operator, and
not part of an identifier?

The last example illustrates another problem with the B approach. Once the
Universal Character Set is well-established, there is every reason to
suppose that formal language designers may wish to take advantage of the
rich operator character encoding in the math section (plus what is still
coming down the road) to create formal syntaxes that look and feel more
like real mathematics for their formal expression syntax. Why not have a
NOT EQUAL operator simply be represented in the language with U+2260
instead of "!=" or "<>" or some other arbitrary digraph coinage that we have
simply gotten used to because of the paucity of ASCII as a representational
medium for expression syntax? There is every expectation that *somebody*
will do this in formal language syntax eventually -- and it would be
more prudent now to constrain identifier syntax conventions so that they
do not impinge on the areas of the character encoding that are likely
to be mined in the future for operators and delimiters.

> >
> > [...] I don't see the need for prohibiting every possible
> > punctuation character or characters such as a smiley or a snow man,
> > even though I would probably not use them in an [identifier] myself. As
> > long as they don't conflict with the [rest of the] syntax, it makes no
> > difference [to the] parser.

Addressed above. Following the Unicode 3.0 guidelines, it is *easier* to
do the right thing for identifiers than it is to try to figure out how
to make the extensions the wrong way. Keep in mind that in the general
case this is not just a-zA-Z0-9 versus everything else. Most parsers for
an international market are going to have to allow Japanese in -- and
once you do that, the lexer already has to deal with a table of
properties, rather than simple range checks. The Unicode approach
allows the identifier lexing to be done effectively with a couple of
property checks (for initial and for succedent), and can be more
straightforward to implement even than range checks.

>
> In other words, programming languages have historically tended to allow
> anything in an identifier that wasn't used for some syntactic purpose;
> leading digits were forbidden to make lexers simpler. What specific
> reason is there not to treat all hitherto-unknown Unicode characters
> as legitimate in identifiers, in the manner of the Plan9 C compiler
> (which extends C to treat everything from U+00A0 on up as valid)?

It will make the world a better place. ;-)

--Ken

>
> I need this to help me write a draft standard, so I'm not asking out
> of randomness.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT