Re: Unicode in source code. WHY?

From: G. Adam Stanislav (adam@whizkidtech.net)
Date: Wed Jul 21 1999 - 22:34:30 EDT


On Wed, Jul 21, 1999 at 02:26:35PM -0700, Kenneth Whistler wrote:
> Treating all Unicode characters above U+007F as just unanalyzed byte values
> acceptable for being part of identifiers, regardless of their properties
> otherwise, is an easy kludge for extending an 8-bit parser to handle
> UTF-8 in a definable way, but it ends up with all kinds of anomalies,
> where quotes aren't quotes, brackets aren't brackets, spaces aren't
> spaces, and equivalences aren't equivalent.

Yes, it is a kludge. I did not mean it as the final solution, but as a step
compiler writers could take now so they can support Unicode to at least
some extent until they develop better ways (or someone develops a
general Unicode preprocessor they all can use).

> > > 2. When are two identifiers to be considered equivalent?
> >
> > When they consist of the same sequence of bytes. That would certainly work
> > for case-sensitive languages. Case-insensitive languages would need to
> > rely on the operating system to make the comparison.
>
> You will, I presume, get to my response next on this thread. Comparing
> just the same sequence of bytes is not acceptable for Unicode. You
> need to specify comparison on a normalized form for this to make
> sense.

Yes, I received your response (and those of others), and now I understand
your concern.

> And language tools that need to do case-folding should *not* be relying
> on the operating system to make the comparison. If they do that, they
> are depending on varying levels of implementation of case-mappings,
> which may or may not be locale-specific or limited to particular
> character encoding repertoires.

I am certainly hoping that OS's will eventually support true Unicode
case-insensitive comparisons. I am very frustrated by the current
C standard or lack thereof.

> Any tool worth its salt that claims
> to be supporting Unicode will bind in a full set of case-mapping tables
> based on the data tables provided by the Unicode Consortium -- with or
> without extensions for special language behavior. That is the way the
> tool can guarantee predictable cross-platform behavior for case-folding.

I think this should still be done on the system level. Why have the same
table in every program, when one copy can do? Besides, as the tables evolve,
it is much easier to change one system call than every single program.

I have given some more thought how this could be accomplished. It seems to
me the simplest way would be to have a library of routines that decompose
any Unicode string (rather than combine any decomposed characters into
a single one), and optionally convert it to lower (or upper) case. While
this would made the strings longer, it could be done fast.

Compilers could call this library routine to assure any identifiers meant
to be the same are the same.

Linkers should not need to. They should just be able to accept any UTF-8
string as an identifier name. As long as compilers (and assemblers) can
produce the same byte sequence for the same identifiers, linkers should
not need to worry about it. Of course, compilers for case-insensitive
languages should convert the identifiers to either all upper case or all
lower case, to make it easier on linkers.

I think this is quite doable. If no such library exists, I am willing to
work on it, that is once this heat wave stops. Right now I can't even
think! Anyway, such a library would complement my UTF-8 library well.
The UTF-8 library converts between Unicode (16- or 32-bit) and UTF-8 in
either direction. The routine I'd need to write could then work on
individual 32-bit values, just "unpacking" them and optionally converting
them to upper or lower case.

I can do it as a "general" library in C usable by Unix or anything else,
and also write a Windows DLL in assembly language doing the same. That
should cover just about any OS in existence, giving compiler writers
to same API regardless of the OS they use. And I'm willing to do it for
free will full source code included.

Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT