Re: Unicode in source code. WHY?

From: John Cowan (cowan@locke.ccil.org)
Date: Wed Jul 21 1999 - 17:03:01 EDT


G. Adam Stanislav wrote:

> Why is it a problem? As long as each identifier is represented by the same
> sequence of bytes every time it is used,

But that's the tricky part. If we have as an identifier the word
"hétérogénéité", there are no less than 32 ways to express this
in Unicode, 31 of which will be rejected by a naive compiler. The
only valid one is, of course, the one used at the point of declaration.

The better approach is either to allow only prenormalized forms,
or for the compiler to do its own normalization. Both of these
require a not-huge table; the latter method means that the work has to
be done only once, with the compiler's sole responsibility that
of rejecting unnormalized forms.

> I happen to use my console in ISO-8859-2 mode.

This is not a parallel case, for 8859-2 has no combining marks
and thus has only one way of representing any given identifier.

-- 
	John Cowan	http://www.ccil.org/~cowan	cowan@ccil.org
Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT