RE: Unicode in source code. WHY?

From: Addison Phillips (AddisonP@simultrans.com)
Date: Wed Jul 21 1999 - 18:02:41 EDT

Next message: Kenneth Whistler: "Re: Off topic: English orthography"
Previous message: Rick McGowan: "Re: Off topic: English orthography"
Maybe in reply to: Addison Phillips: "RE: Unicode in source code. WHY?"
Next in thread: Jonathan Rosenne: "RE: Unicode in source code. WHY?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

UTF-8 is a kludge.

A useful and valuable kludge, mostly used to get Unicode support into
"legacy" applications, especially those that rely on parsers and expression
engines... in fact, the original UTF-8 was for UNIX-like file systems. The
specific design details of UTF-8 are well tuned to these requirements (e.g.
not having to rewrite OS utilities and functions to get predictable
behavior). In that respect UTF-8 is pretty neat.

But it's still a kludge. At some point, "real" Unicode text files should
become the norm, rather than having to transform everything. Let's prompt
editor writers to create editors that read and write Unicode without blowing
chunks. UTF-8 is merely a detour (albeit a very useful one).

Addison

PS: <grinding axe>Yes, I know that the standard says UTF-8 is "real"
Unicode. But UTF-8 should not, IMHO, be the encoding *of choice* for the
future. It's the encoding of choice for supporting the past.</grinding axe>

-----Original Message-----
From: G. Adam Stanislav [mailto:adam@whizkidtech.net]
Sent: mercredi 21 juillet 1999 13:49
To: Unicode List
Cc: Unicode List; mohrin@sharmahd.com
Subject: Re: Unicode in source code. WHY?

On Tue, Jul 20, 1999 at 10:57:31PM -0700, Jonathan Rosenne wrote:
> We also know that those environments that do allow the use of Unicode are
> not all compatible. I see two main problems:
>
> 1. Should the full Unicode repertoire be allowed, or just a subset?

Why is that a problem? 8-bit bytes should be allowed without question.
That allows all of non-ASCII Unicode, including punctuation, quotation
marks, and everything else, UTF-8 encoded. We would finally be able to
use non-blanking space instead of the underscore kludge.

UTF-8 encoding makes the various environments compatible.

> 2. When are two identifiers to be considered equivalent?

When they consist of the same sequence of bytes. That would certainly work
for case-sensitive languages. Case-insensitive languages would need to
rely on the operating system to make the comparison.

Adam

Next message: Kenneth Whistler: "Re: Off topic: English orthography"
Previous message: Rick McGowan: "Re: Off topic: English orthography"
Maybe in reply to: Addison Phillips: "RE: Unicode in source code. WHY?"
Next in thread: Jonathan Rosenne: "RE: Unicode in source code. WHY?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT