Re: Unicode end-users

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Fri Aug 01 1997 - 15:07:28 EDT


Graham Rhind wrote on 1997-08-01 07:39 UTC:
> Are there plans to enable Unicode to function as
> ASCII does, for example, so that it is application independent and is of
> direct use to the user rather than just to software developers?

In the Plan9 operating system (the current work of the guys who developed
Unix), ASCII has already several years ago been replaced totally by UTF-8.
On Plan9, you use UTF-8 as the *ONLY* character encoding. You can use
therefore greek and cyrillic characters everywhere like latin, in source
code, file names, environment variables, user names, passwords, printer
names, etc. Plan9 (like Unix) does not have ANY notion of code pages
and switching between character sets. Therefore, you have to
introduce UTF-8 all over the place at once, which is the simplest and
most practical solution. In code page and character set switching systems
like MIME and Microsoft, Unicode will always be just one of several
possible encodings and therefore it will always be more a part of
the end-users problems than being a part of the end-users solution.
Unlike Windows, Plan9 does not have separate system calls and library
functions for Unicode and for 8-bit code pages. You cannot avoid to
use UTF-8 under Plan9 as an applications programmer.

For Unix, UTF-8 as the exclusive only way of representing characters
is clearly the way to go, because there are no character set switching
mechanisms and in order to minimize changes necessary to existing software,
the encoding must be ASCII compatible.

End-users do not care about character-set switching. They want to have
one single simple to understand encoding that is used universally
everywhere. All this talk about the encoding inefficiency of UTF-8
or UCS-2 compared to 8-bit code pages is just complete academic nonsense:
a) storage prices drop to 50% every two years and they will continue to
do so over the next 10 years, and b) only a few percent of memory are
usually used to store uncompressed text. We live in a time where
application software does not fit on a single CD-ROM any more, so
don't claim that 16- or 24-bits per character is an unbearable waste
of memory. Switching mechanisms are an unbearable waste of complexity,
however.

Systems like the MIME or ECMA registries with their hundreds of
different encodings are nothing the end user is interested in. The end
user wants to type any character any time any where, and this is only
possible with a single system-wide encoding. For Unix and most Internet
protocols, this must be UTF-8 due to the ASCII legacy, for other
more modern environments, UCS-2 might be a better approach.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT