RE: Unicode in source code. WHY?

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 21 1999 - 13:04:49 EDT


Jony asked:

>
> We also know that those environments that do allow the use of Unicode are
> not all compatible. I see two main problems:
>
> 1. Should the full Unicode repertoire be allowed, or just a subset?

As John Cowan pointed out, Section 5.14 of the Unicode Standard, Version 2.0,
already provided some suggestions on this topic. That section has been
tightened up and simplified in the corresponding section of the Unicode
Standard, Version 3.0, forthcoming. In particular, the syntactic rule for
an identifier is simply:

<identifier> ::= <identifier_start> (<identifier_start>|<identifier_extend>)*

where the syntactic class of <identifier_start> can be deduced from the
Unicode Character Database category set: {Lu, Ll, Lt, Lm, Lo, Nl}; and
the syntactic class of <identifier_extend> can be deduced from the
Unicode Character Database category set: {Mn, Mc, Nd, Pc, Cf}.

This gives a very broad but reasonable definition of identifier in
the Unicode implementation context. It would allow most strings that
correspond to words in natural languages to be valid identifiers, with
exceptions for punctuation like '-' that tend to be used as operators
or delimiters in formal language syntaxes.

Individual formal language definitions can (and do) choose to constrain
their identifier syntax further, and in some instances allow additional
symbols in identifiers (e.g. '@'). That is up to those language definitions
and standards to define. The Unicode guidance is simply to provide the
language standardizers with a coherent starting point when they are faced
with what for them is the vast and confusing new world of Unicode character
semantics. Clearly, the hacks to date for dealing with identifier extensions
for DBCS are insufficient. Typically, they consisted of stating that
identifiers consisted of letters, digits, syllables, and/or ideographs.
That was a comprehensible definition for Japanese encoded with JIS, perhaps,
but very hard to extend to all of Unicode.

Note, there is also an ISO TR on this subject: ISO/IEC TR 10176, Information
technology -- Guidelines for the preparation of programming language standards.
Annex A of that document is "Recommended extended repertoire for user-defined
identifier". It contains a specific subset of 10646-1 (i.e. Unicode) characters
recommended for use in identifiers. The Unicode Consortium has worked with
SC22/WG20 to get that annex amended so that it is a strict subset of the
set of characters recommended by the UTC for use in identifiers (i.e. that
shown above). That way there is a better chance for interoperability between,
e.g. Java applications (the JDK follows the UTC recommendations) and such
languages as C/C++ and Cobol, which in some cases already have normative
references to Annex A of TR 10176.

>
> 2. When are two identifiers to be considered equivalent?

My suggestion on this is that two identifiers should be considered
equivalent when their (canonically) normalized forms are binary equivalent.

For this purpose, the obvious normalized forms to use are either Form D
(decomposed) or Form C (composed), as specified in UTR #15. Since both
forms are now well-defined, there can be a definite, unambiguous answer
for the question of identifier equivalence.

A corollary of this for formal language implementations, which might not
wish to carry the burden of on-the-fly normalizations, is that it may
be best to enforce normalized forms for all identifiers in program
text and at points of exposure of identifiers (e.g. for program
linking), so that simple binary comparisons provide the correct answers
for extended identifiers using Unicode. But this is an issue for
formal language standardizers to take up. It is really beyond the purview
of the Unicode Standard, per se.

>
> I suggest the Unicode consortium should address these issues and any others
> in a TR.

As I have noted, these issues are already addressed, at least in part,
by the standard itself and UTR #15. The first fundamental issue to have
decided was Normalization itself. Perhaps another UTR may be developed
to elaborate on the implications for formal language syntax -- bearing in
mind the already existing recommendations of TR 10176.

--Ken Whistler

>
> Jony
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT