Re: KOI-8

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Mon Sep 02 1996 - 11:53:25 EDT


> Incidentally, I heard that Columbia has joined up and that you are the
> representatitve. Welcome!
>
Thanks. I have been interested in computer character sets ever since I read
Joe Becker's first Scientific American article, which I still regard as one of
the most influential articles ever published in computing -- or if it isn't,
it should be.

As you may know, the Kermit Project has made supporting the widest possible
variety of character sets for data interchange a primary objective since the
mid 1980s, and Kermit software fills a unique "low-tech" niche all over the
world largely for this reason.

Now that we are beginning to see Unicode used in systems like Windows NT, it
is time for Kermit software to start supporting Unicode too. Of course, for
our purposes, Unicode is just "one more character set" (just as POSIX is "one
more UNIX variation" :-) But with a difference, since Unicode will be
enormously useful as an intermediate/internal representation for text, because
most of the world's characters are in it.

As to the Kermit protocol, it sticks firmly to the idea that anything that
goes on the wire must follow ISO or other definitive standards. So when
transferring (say) Portuguese text files from, say, a Macintosh to a PC, the
Mac translates from its own encoding to Latin-1, and the PC translates the
incoming Latin-1 to a PC code page, so the two computers speak ISO 8859-1 to
each other even though neither one uses it internally. This reduces an
"O(n^2)" problem to "O(n)".

But to support non-Western-European languages, we need multiple "transfer
character sets": Latin-2, Latin/Cyrillic, Latin/Hebrew, Japanese EUC (a JIS X
0201 and 0208 combined via ISO 2022), etc. This scheme has worked out quite
well. But when we start adding Vietnamese, Greek, Chinese, Korean, Arabic,
and so on, this approach will become increasingly difficult to manage.

Thus Unicode or some form of ISO 10646 could also easily wind up as our
preferred transfer character set, but for the obvious concerns about
transmission efficiency -- remember Kermit is still the low-tech alternative
to all the hot stuff that most Americans tend to take for granted, and
efficiency is still a concern for many people.

I've followed the Unicode list with interest since the beginning and made a
couple contributions along the way (you can thank me for U+2028 and 2029 :-).
I'm a bit put off by the tenor of the recent discussion of whether combining
items should precede or follow the base character (a moot point now) because
the only place this would matter is in text-oriented (as opposed to GUI)
applications, and nobody uses text-mode software any more. I'm here to tell
you that's not quite true, yet :-) In addition to the diehard speed-typing
curmudgeons who feel a mouse only slows them down, there are also the many
blind people who use speaking and Braille devices that work only in text mode.

> PS: I don't think Unicode has a KOI-8 mapping table (even for the
> old and apparently well known version). Do you have or know of any
> that we could add to our ftp site..?
>
Not as such, but I do have a table of KOI-8. You can find all of the
tables we use in the Kermit project at our ftp site:

  ftp://kermit.columbia.edu/kermit/charsets/

The KOI-8 table is koi8.txt, and the C program that generates is is koi.c.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT