RE: Unicode in source code. WHY?

From: Addison Phillips (AddisonP@simultrans.com)
Date: Wed Jul 21 1999 - 17:10:06 EDT


Hi Adam,

Thanks for your comments.

I think the problem would be because some text processors will generate the
combined (precomposed) form and some will generate the uncomposed forms and
they may alternate in the same piece of code (say, for example, between the
library you're referencing and the code you write yourself), the parser
would need to understand and recognize that these are, in fact, the SAME
identifier... so it wouldn't necessarily be the same sequence of bytes.

I understand about the code page trick that you're using. I agree that we
can work around the problem, but....

What I'm arguing for is: tools developers need to recognize the need to
input, process, and display Unicode characters properly (*NOT* converting to
code page to do it either, because that is a script dependent solution that
is bound to make problems. I program in Japanese all the time... but I also
want to work with Chinese and Korean and German and... and I don't want to
have to keep switching my code page/locale or resorting to tomfoolery to do
it). I want the tool to use the UCS properly so I can concentrate on
important things--like why is this fool programmer strtoking for "\"...

Double-byte languages are really a pain currently. An editor that
understands Unicode (or DBCS character sets) doesn't screw up my code and
allows me to edit in place. That's what I need/want/desire (not an editor
that works by accident most of the time but not always).

thanks,

Addison

-----Original Message-----
From: G. Adam Stanislav [mailto:adam@whizkidtech.net]
Sent: mercredi 21 juillet 1999 13:34
To: Addison Phillips
Cc: Unicode List; mohrin@sharmahd.com
Subject: Re: Unicode in source code. WHY?

On Wed, Jul 21, 1999 at 10:08:14AM -0700, Addison Phillips wrote:
> Clearly text editors (which means programming environments) should support
> all of Unicode.
>
> Identifiers are a thornier issue. Combining marks versus precomposed
clearly
> presents a problem in this area.

Why is it a problem? As long as each identifier is represented by the same
sequence of bytes every time it is used, why should a compiler care whether
combining marks or precomposed characters were used? For all the compiler
needs to know, it is just a unique sequence of bytes.

It is also not necessary for all text editors to support all of Unicode.
I program with editors that run under a FreeBSD console. By the design of
the underlying hardware (the VGA), they are restricted to the maximum of
256 characters.

I happen to use my console in ISO-8859-2 mode. The editor does not know
that. When I type a Central European character on my keyboard, the VGA
displays it in the editor properly, even though the editor has no idea
what charset I am using. I can easily convert the file into Unicode,
or UTF-8, and back. It would be *nice* if the editor could support all
of Unicode, but the editor is fully useful for my programming needs
as is. The editor is doing the best it can given the limitations of the
environment it is running under.

Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT