Identifiers and tokens

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 02 2000 - 15:55:39 EST


Dan Oscarsson noted:

>
> One important thing to remember is that there are several types
> of identifiers. For example, the ones used on variables and those used
> on operators.
> For variables it might be a good idea to restrict the characters allowed
> to "word like", while an operator could use nearly any type of character.
> When I define a new comparing operator, I do not want to call it
> "equals", I want to call it "=" (or "==" if you are a C programmer).
>
> When you leave the ASCII range there are many more good non-letters that
> can be used. (for example, I use the not sign "¬" instead of ! in some
> interfaces). So you have to allow many of the non-letters in some types
> of identifiers.
>

I just want to point out that Dan's use of the term "identifier" here
is a little different from the Unicode Standard's use of the term.
What Dan calls an "identifier" is what I would prefer to call a "token".

With the caveat that I am not an expert in compiler design or formal
language structures, my understanding of the basics here are as
follows:

A parser of program text following a particular formal language syntax
will first parse out the *tokens* of the text. To do so, it needs
a formal definition of whitespace, which marks the edges of tokens,
delimited tokens, or isolated delimiters. It also needs a formal
definition of delimiters (e.g., "(", ")", "{", "}", ";", etc. in C),
which are often baked into the formal syntax of complex constructs
in the language.

Once a token is identified, it is then run through a lexer, which will
typically sort it out as comprising either:

   a. an identifier
   b. a numeric constant
   c. an operator

The operators typically consist of a string of one or more characters
that do not fit the formal syntax for identifiers or numeric constants,
and comprise some fixed list for the formal syntax. Thus, for example,
C defines a list of assignment-operators as: -, *=, /=, %-, +=, -=,
<<=, >>=, &=, ^=, and |=.

Once an identifier has been identified, it has to be checked against
the *keywords* of the formal syntax. The keywords are constructs that
formally follow the identifier syntax, but which are reserved for
special purposes in the language. Some of these may function to
create complex expressions ("if", "else", "for", "while", etc., in C).
Some may themselves function as more delimiters ("BEGIN", "END" in
Pascal). Some may function as more operators ("AND", "OR" in Pascal).
And some may represent predefined numeric constants ("MAXINT", "PI", etc.).

What are left after you subtract away the keyword identifier tokens
are the free identifiers -- those defined by the end user of the
formal language. There may, in turn, be subclassification of the
free identifiers, depending on the nature of the formal language.
For example, in SQL, free identifiers starting with the character "@"
represent local variables. Free identifiers can have multiple
functions: sometimes they are variables, sometimes constants,
sometimes function names, and so on and so on. Again, depending on
the formal syntax, free identifiers might themselves be used to define
new operators -- a fairly common occurrence in object-oriented languages,
for example.

What the Unicode Standard's recommendation for "identifier" is
focussed on is "identifier" as described in this brief summary --
not "token". The limitations it suggests on which characters should
be included in identifiers are there in part to allow appropriate
room for expansion in formal language syntaxes for what can function
as a *delimiter* or as an easily lexable *operator*. This does not
constrain formal language designers from making keyword use of identifiers
to create more delimiters and operators. On the other hand, it does
make it possible to set aside more delimiting punctuation and more
functional operators as special characters to create more fully
expressive, complete, and elegant formal language syntaxes in the future.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT