Re: Unicode in source code. WHY?

From: G. Adam Stanislav (adam@whizkidtech.net)
Date: Wed Jul 21 1999 - 22:04:51 EDT


On Wed, Jul 21, 1999 at 03:03:57PM -0700, Addison Phillips wrote:
> UTF-8 is a kludge.

UTF-8 is the only encoding besides ASCII that all Internet protocols are
required to understand (I do not recall the RFC number, but I can look
it up if you wish). And since ASCII is a subset of UTF-8, one may say
UTF-8 is the one and only required encoding.

> [snip]
> But it's still a kludge. At some point, "real" Unicode text files should
> become the norm, rather than having to transform everything. Let's prompt
> editor writers to create editors that read and write Unicode without blowing
> chunks. UTF-8 is merely a detour (albeit a very useful one).

The problem is there is no such thing as "real" Unicode text files. Unicode
is 16-bit, ISO is 32-bit. Which one is real? Some systems are big-endian,
others little-endian. Which one is real?

IMHO, Unicode made a wise choice not to decide. The way I read it, Unicode
is not an encoding but a mapping, hence there is no "real" Unicode text
file. Or, perhaps, any encoding one can think up is real as long as it
can encode all of Unicode and decode it back.

UTF-8 solves all these incompatibilities in a nice way. I agree that it is
not perfect (ASCII only text needs no encoding, Roman alphabets with
diacritics require some encoding, Chinese and other non-European characters
require a lot of encoding). But it is the best I have see so far.

> PS: <grinding axe>Yes, I know that the standard says UTF-8 is "real"
> Unicode. But UTF-8 should not, IMHO, be the encoding *of choice* for the
> future. It's the encoding of choice for supporting the past.</grinding axe>

The rule that all Internet protocols must understand UTF-8 only started
several months ago. That makes it very much the encoding of the future.
At least on the Internet.

Adam



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT