Re: Unicode block for programming related symbols and codepoints?

From: Ken Whistler <>
Date: Mon, 09 Feb 2015 10:17:09 -0800

I think this discussion is confusing the need for separate syntactic
in formal language definitions with the need for *encoding* of characters.

The distinction between assignment and test for equality has been around for
decades in formal languages, and of course it is almost always carefully
in the formal syntax:

C, C++ and kindred

Use "=" for assignment.
Use "==" for equivalence operator.

Pascal and kindred

Use ":=" for assignment.
Use "=" for equivalence operator.


Assignment: let (a 6)
Equivalence evaluation: (= a 6)

And so on. The fact that these formal languages do not use a *single*
character for each of these syntactic functions is not a formal defect
-- there
are many, many concepts in formal languages which are defined using
sequences of characters, rather than a single character. As has already
been alluded to in this thread, trying to stack all functionality into
character definitions heads back in the direction of relatively illegible
APL program text. It might have its place, but isn't much of a choice for
widely used general programming languages.

There are two basic issues with using sequences of (typically ASCII)
for fundamental operators:

1. It marginally complicates parsing.
2. If chosen badly, they can confuse programmers using the syntax.

#1 is basically trivial, as long as the formal syntax passes the bar of not
introducing syntactic ambiguity.

#2 is the *real* problem, imo. The use in C of "=" and "==" was badly
from the start, and is the source of bezillions of inadvertent programming
errors in practice.

But if a left arrow, for example, might be a better choice for an assignment
operator in a programming language, and a two-character ASCII operator
like ":=" or "<-" doesn't seem appropriate or causes other confusion, there
still isn't a character *encoding* issue here. Just use "←", which
already exists (U+2190),
and is a fine left arrow!

What is *not* appropriate for Unicode consideration here is trying to
encode programming *functions* per se. That turns the problem on
its head really. There are lots and lots of symbols already defined in
the standard: it is the job of formal language designers to simply pick
from them and *define* their formal functions in their language design.

Just because the UTC occasionally invents new control functions and
encodes them in characters -- as for the bidirectional algorithm -- does
not mean that every new function conceived for a programming language
is automatically a character encoding problem. Coming to the UTC
looking to encode a "new functional character" on spec should be
a matter of *last* resort -- not a first resort. It requires a carefully
built case demonstrating a real use and showing that alternative
approaches using existing characters do not (and cannot) work.


P.S. Arrow symbols like U+2190 have been in the Unicode Standard
since Unicode 1.0 in 1991. They are far, far more widely supported nowadays
than any new, language-specific functional symbol addition would be.
Even if the UTC agreed to such character additions at the next meeting
in May, its earliest opportunity for publication would be Unicode 10
in June, 2017. That amounts to a 26 year impedance mismatch for
implementations. Why would a designer of a new formal language
syntax want to buy into that kind of grief for character availability,
when there are hundreds of symbols in the standard to choose from
that have been encoded for decades now?

On 2/9/2015 8:41 AM, Andre Schappo wrote:
> Let me take as an example the use of = in programming. The = is used
> for test of equality and assignment in various programming languages.
> The equality and assignment operations should have different
> characters. e.g.
> Initially the glyphs used for these characters could be = but then
> this mechanism can be used to transition to a new and less ambiguous
> visual representation. The new visual representation could be
> something like
> Such a visual and character distinction between the 2 functions must
> surely make it easier for those learning to program and for
> interpreter and compiler writers. I think it would also make for
> easier to read/understand program code.
> André

Unicode mailing list
Received on Mon Feb 09 2015 - 12:18:17 CST

This archive was generated by hypermail 2.2.0 : Mon Feb 09 2015 - 12:18:17 CST