Re: Unicode in source code. WHY?

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 21 1999 - 17:06:58 EDT


Adam responded:

>
> On Wed, Jul 21, 1999 at 10:08:14AM -0700, Addison Phillips wrote:
> > Clearly text editors (which means programming environments) should support
> > all of Unicode.
> >
> > Identifiers are a thornier issue. Combining marks versus precomposed clearly
> > presents a problem in this area.
>
> Why is it a problem? As long as each identifier is represented by the same
> sequence of bytes every time it is used, why should a compiler care whether
> combining marks or precomposed characters were used? For all the compiler
> needs to know, it is just a unique sequence of bytes.
>

The problem comes from canonical equivalences. If you do not take this
into account, you could have two pieces of program text that from
a user's point of view ought both to be valid, but one would match
an identifier and compile correctly, while the other would not match
an identifier and cause compilation errors. This sort of problem should
not be shoved off to people with the suggestion that they look at and
fix their program text with a hex code editor to find the differences
in canonically equivalent sequences that otherwise appear the same.

The problem is as bad or worse for linking. An identifier stored in
decomposed form won't link with a module expecting a composed form
of the "same" character, unless the linker is cognizant of canonical
equivalences (and has been instructed to treat them as identities).

Canonical equivalence and normalization must be taken into account. Otherwise,
you are just asking for massive confusion in tools. Or the formal language
standardization groups will take the issue into their own hands and
simply mandate that no combining marks can be used (or variations
on that theme). This is, in fact, the direction the traditional
languages (C/C++) community is headed -- which, for better or worse,
will lead to some fairly arbitrary limitations on what characters
can be used for identifiers.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT