Re: Rationale wanted for Unicode identifier rules

Date: Wed Mar 01 2000 - 15:05:29 EST

I got my Unicode 3.0 book this morning (thank you Amazon)... but it's at home so I can't refer to it.

The only things off-the-top-of-my-head that I can think of here is that we might want to prevent certain "equivalent" characters or compatibility characters from being used in identifiers. In other words, if you pass the code text through a normalization the identifiers should all be legal.

The point here is that the use of combining characters versus precomposed characters should not result in *separate* identifiers: if it looks the same on the screen it should be the same to the compiler. This implies normalizing the text as a precondition to lexing and depending on which normalization form you choose the punctuation and other characters could be normalized into illegal sequences... so not everything above U+00A0 is legal.



Addison P. Phillips
Senior Globalization Consultant
Global Sight Corporation
101 Metro Drive, Suite 750
San Jose, California 95110 USA
(+1) 408.350.3649 - Phone

Going global with your web site? Global Sight provides Web-based
software solutions that simplify the process, cut costs, and save time.

Sent by: John Cowan <>
03/01/2000 10:46 AM

To: "Unicode List" <>
Subject: Rationale wanted for Unicode identifier rules

(Still waiting for my bookstore to get 3.0 book.)

Section 5.14 of 2.0 says:

# The formal syntax provided here is intended to capture the general
# intent that an identifier consists of a string of characters that starts
# with a letter or an ideograph, and then follows with any number of letters,
# ideographs, digits, or underscores.

Can anyone give me a rationale for rejecting the following argument:

> There are some [syntax] characters we know we need to prohibit [in
> identifiers, such as +, -, etc.], as well as a couple of ranges of
> control characters, but other than that I'm not sure why it's worth
> bothering.
> [...] I don't see the need for prohibiting every possible
> punctuation character or characters such as a smiley or a snow man,
> even though I would probably not use them in an [identifier] myself. As
> long as they don't conflict with the [rest of the] syntax, it makes no
> difference [to the] parser.

In other words, programming languages have historically tended to allow
anything in an identifier that wasn't used for some syntactic purpose;
leading digits were forbidden to make lexers simpler.  What specific
reason is there not to treat all hitherto-unknown Unicode characters
as legitimate in identifiers, in the manner of the Plan9 C compiler
(which extends C to treat everything from U+00A0 on up as valid)?

I need this to help me write a draft standard, so I'm not asking out
of randomness.


Schlingt dreifach einen Kreis vom dies! || John Cowan <>
Schliesst euer Aug vor heiliger Schau,  ||
Denn er genoss vom Honig-Tau,           ||
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT