Re: Unicode in source

From: G. Adam Stanislav (adam@whizkidtech.net)
Date: Thu Jul 22 1999 - 16:32:50 EDT


On Thu, Jul 22, 1999 at 09:29:14AM -0700, Addison Phillips wrote:
> Actually, I am aware of the advantages for using UTF-8 in Internet
> transmissions: it's inherently (*explicitly*) non-endian and well-suited to
> byte stream applications (the encoding allows recovery with only one lost
> character in the event of transmission error).
>
> But for usage internal to programs and for text files stored on disk (e.g.
> the default encoding for most applications, although not, admittedly, for
> Web presentation), UTF-16 *is* cleaner. To wit:

Internally a program will presumably decode UTF-8 into whatever format it
uses. As for being stored on disk, what if the disk is on a LAN consisting
of PC's and Macs? Should it be stored in little-endian or big endian order?

What if two (or more) programmers work on the same code, each using a different
system?

What if the same one programmer takes some code home and uses a different
system than at work?

No problem with UTF-8, lots of problems otherwise.

Besides, UTF-16 can only contain the first plane. Even though, strictly
speaking, Unicode is 16-bit, the ISO standard (is it 10646?) is 32-bit.

Most importantly (to me, anyway), there are millions++ lines of code
already written in 7-bit ASCII. A UTF-8 compiler can still compile them.
A 16-bit (or a 32-bit) compiler cannot unless it has two input processors.

> o The characters are all 16-bits in the BMP,

At this time they are. In the future they may not be. What if you want to
write a program that uses Ugarithic text for its output? AFAIK, that is not
in the BMP.

> o There is less text expansion for non-Latin languages.

Yes, but with a well written expansion library (that I have been proposing)
it happens fast and is completely transparent to the compiler writer.

> o There are programmatic ways of handling Unicode text via TCHAR that
> reduces the impact on code. If you don't unthread UTF-8 to UTF-16, text
> processing becomes somewhat uglier.

Again, that can be completely transparent. More importantly, TCHAR is of
different sizes in different OS's. For example, under Windows 95+/NT, TCHAR
is 16 bits wide. Under FreeBSD (and probably other Unices) it is 32 bits
wide.

Take for example the computer I am writing this message on: An old
Pentium 100 with two OS's installed. Sometimes I boot to Windows 95,
sometimes to FreeBSD. I store most of my source code on the Windows
partition of the hard disk because FreeBSD can read and write files
from and to the Windows file system but Windows cannot do the same
with the BSD file system.

I edit the same exact files under both systems. The only incompatibility
is that Windows inserts carriage returns before line feeds and Unix
does not. But editors on both system can handle this minor quirk.
So can compilers.

Suppose I wanted to use unencoded Unicode. I would constantly have
to convert each file between Windows 16-bit TCHAR and Unix 32-bit
TCHAR. That would be a major annoyance. I would also most likely
end up with a corrupt file when I would accidentally open the
file in one format under a different OS and wrote to it. I have
no doubt this would happen sooner or later no matter how careful
I was.

> o For languages other than Western European, the text expansion is much
> smaller than for UTF-8, so storage is conserved.

Yes, it is. I consider engineering to be the art of compromises. It
does not have to be 100% perfect for me to accept it. As long as the
advantages outweigh the disadvantages, I can live with it.

> If you are going to write a text processing application, why would you make
> UTF-8 the default internally, when UTF-16 is so much easier to code?

I would convert the UTF-8 file to 32-bit values internally. I would make
UTF-8 the default for the file only. Files are external to the application
(in this case a compiler), and have nothing to do with the internals of the
application. Files need to be optimized for portability among various
systems, not for the internal workings of the compiler or assembler.
If I wanted to optimize files to make it easy on the compiler, I would
have to write the code directly in machine language.

> Programmer's editors are, of course, text processing applications, and while
> they need to handle UTF-8 (read and write), internally UTF-16 is going to be
> much cleaner. Are you arguing to use UTF-8 because it makes the lexical
> analyser you've already written able to sorta-kinda process Unicode?

It has nothing to do with my lexical analyzer. Nor is it sorta-kinda for
my lexical analyzer (given the nature of Graphic Counter Language, the
solution I am using of accepting any 8-bit byte with high bit set is
perfect - the language does not contain any user defined identifiers. The
language allows a way of natural language processing by simply ignoring
any token that is not a keyword, an integer, or an operator, and by having
predefined macros for most operators, so all I have to do for Unicode support
is to *ignore* any token containing bytes with the high bit set. E.g.:

        Computer, kindly make the count equal to 1000. Thank you.

is identical to:

        count = 1000

So my simple solution works very well for this particular language.)

I am talking about programming languages in general. New and old. Using
UTF-8 encoding for source files has more advantages than disadvantages.

> I think
> the other messages on this thread clearly show why this is a potential
> problem if we allow Unicode into our identifiers. (It's not a problem if you
> confine Unicode to string literals). Write a lexer than can handle
> UTF-16/UCS-2. It's a lot easier to preprocess all of your text to that
> encoding before lexing it than it is to have multi-octet data.

Naturally. The lexer does not even need to know what format the source
file is in. The file input routines can read the file in UTF-8, decode it,
and pass 32-bit (or 16-bit if you insist) values to the lexer. All you need
to change in the lexer is the name of getchar or equivalent. To the lexer
it is completely irrelevant what format the source code is in.

> I'm not saying that UTF-8 is bad. I *like* UTF-8 and cherish a warm place in
> my heart for it as an encoding. It is, in fact, a beautiful design.

Hehe. That is a complete turn-around from calling it a kludge. :-)

> So I'm not against people storing UTF-8 on disk if that's really their
> hearts desire, but I think it makes sense to use UTF-16 wherever possible in
> implementation. IMHO.

So far this discussion was about Unicode in the source code. Not about what
happens inside the lexer and the rest of the stages of compilation. I never
suggested the compiler should keep the code in UTF-8 format all the way. Nor
did anyone else.

As I say above, it should be in 32 bits by the time it gets into the lexer.
Let me add that it should be the lexer that further decides whether a token
needs to be decomposed and perhaps converted to all upper or all lower case
(since it is the lexer that knows whether a character belongs to a string
literal or to an identifier).

But I am strongly suggesting the source should be stored in UTF-8 on disk.
Not because of what the heart desires but because it solves many problems
that arise from storing the source unencoded, namely the problems of
portability of the source code: 32 bits vs. 16 bits, little-endian vs.
big-endian, legacy code, etc.

Perhaps we are in complete agreement after all. :-)

Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT