Re: Unicode in source

From: schererm@us.ibm.com
Date: Thu Jul 22 1999 - 18:14:10 EDT


there are some technical mishaps in stanislav's email, see below with "***".
i am trying not to get into "what is better".
markus

"G. Adam Stanislav" <adam@whizkidtech.net> on 99-07-22 13:29:31
Subject: Re: Unicode in source

On Thu, Jul 22, 1999 at 09:29:14AM -0700, Addison Phillips wrote:
> Actually, I am aware of the advantages for using UTF-8 in Internet
> ...

Internally a program will presumably decode UTF-8 into whatever format it
uses. As for being stored on disk, what if the disk is on a LAN consisting
of PC's and Macs? Should it be stored in little-endian or big endian order?

What if two (or more) programmers work on the same code, each using a different
system?

What if the same one programmer takes some code home and uses a different
system than at work?

No problem with UTF-8, lots of problems otherwise.

*** if you worry about this, then you can use a BOM (signature);
*** for example, windows nt notepad and win98 wordpad
*** write it and recognize it.
*** i have got some java code that auto-detects it and
*** instantiates a matching Reader object.

Besides, UTF-16 can only contain the first plane. Even though, strictly
speaking, Unicode is 16-bit, the ISO standard (is it 10646?) is 32-bit.

*** no. it is UCS-2 that can only contain the first plane.
*** UTF-16 encodes ("contains") the first 17 planes.

Most importantly (to me, anyway), there are millions++ lines of code
already written in 7-bit ASCII. A UTF-8 compiler can still compile them.
A 16-bit (or a 32-bit) compiler cannot unless it has two input processors.

*** as with the mentioned piece of java code, you could do that
*** with a flexible input function/object. most compilers have a command
*** line flag for the encoding, and they could do signature detection, too.

> o The characters are all 16-bits in the BMP,

At this time they are. In the future they may not be. What if you want to
write a program that uses Ugarithic text for its output? AFAIK, that is not
in the BMP.

*** no. the BMP is the Basic Multilingual Plane, i.e., the first plane.
*** its codes always fit into 1x16b.
*** if Ugarithic gets onto plane 1, then it is not going to be
*** in the BMP per definitionem and will need 2x16b in UTF-16, 21b in UTF-32.

> o There is less text expansion for non-Latin languages.

Yes, but with a well written expansion library (that I have been proposing)
it happens fast and is completely transparent to the compiler writer.

> o There are programmatic ways of handling Unicode text via TCHAR that
> reduces the impact on code. If you don't unthread UTF-8 to UTF-16, text
> processing becomes somewhat uglier.

Again, that can be completely transparent. More importantly, TCHAR is of
different sizes in different OS's. For example, under Windows 95+/NT, TCHAR
is 16 bits wide. Under FreeBSD (and probably other Unices) it is 32 bits
wide.

*** TCHAR on win95/98 is always 8b=char and assumes SBCS or MBCS.
*** TCHAR on win nt/2000 is either the same or is 16b Unicode,
*** depending on #define UNICODE and #define _UNICODE
*** i have dealt with this myself for 2 years now.
*** nt does have both flavors of api entry points,
*** with different function names.

Take for example the computer I am writing this message on: An old
Pentium 100 with two OS's installed. Sometimes I boot to Windows 95,
sometimes to FreeBSD. I store most of my source code on the Windows
partition of the hard disk because FreeBSD can read and write files
from and to the Windows file system but Windows cannot do the same
with the BSD file system.

I edit the same exact files under both systems. The only incompatibility
is that Windows inserts carriage returns before line feeds and Unix
does not. But editors on both system can handle this minor quirk.
So can compilers.

Suppose I wanted to use unencoded Unicode. I would constantly have
to convert each file between Windows 16-bit TCHAR and Unix 32-bit
TCHAR. That would be a major annoyance. I would also most likely
end up with a corrupt file when I would accidentally open the
file in one format under a different OS and wrote to it. I have
no doubt this would happen sooner or later no matter how careful
I was.

*** once you are in a file system, there is no such thing
*** as "unencoded Unicode". it always follows some encoding scheme.
*** this may be one of the UTF's, or SCSU, or something else.
*** this would also include the implicit or explicit (BOM)
*** declaration of the encoding including its byte order.
*** explicit specifications could also be external to the file itself,
*** as a file attribute or similar.
*** other encodings should then be able to be specified, too.
*** aside from some form of specifying the file encoding, your editor
*** on either system could be set to save in any encoding desirable.

*** as you state below, you expect to always convert between your favorite
*** file encoding (UTF-8) and your favorite internal form (UTF-32/UCS-4)
*** anyway.

...

But I am strongly suggesting the source should be stored in UTF-8 on disk.
Not because of what the heart desires but because it solves many problems
that arise from storing the source unencoded, namely the problems of
portability of the source code: 32 bits vs. 16 bits, little-endian vs.
big-endian, legacy code, etc.

*** often, right now, that decision is made by what your compilers take
*** and by what your source code control system can accept.
*** for windows, the resource compilers rc and mc can work with any
*** windows codepage as well as ucs-2, and so can sourcesafe.
*** only utf-16 beyond ucs-2 does not work on windows (yet).
*** wishes of programmers have not had much effect on getting
*** a choice beyond microsoft's products.
*** (notice that i work for someone else)

*** [any view expressed is my own and not that of
*** any employer or competitor :-]
*** [any new mistakes are my own, too...]

Perhaps we are in complete agreement after all. :-)

Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT