Unicode and Kermit

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Sun Aug 08 1999 - 16:13:33 EDT

I've begun to add Unicode support to Kermit file transfer. First, a bit of
background, then some questions...

When Kermit transfers a text file, it converts from the local encoding (the
"file character set", such as CP850, KOI-8, etc) to a standard encoding on the
wire (the "transfer character set", such as Latin-1, Latin/Cyrillic, etc), and
then the receiver converts from the transfer encoding to its own local
encoding (e.g. Apple, etc):

  Source FCS --> TCS --> Destination FCS

The sender announces the Transfer character set (TCS) to the receiver using
ISO Registration numbers. This is written up in greater detail in the (now
rather dated) character-set related papers at:


As in any other protocol, only well-defined international standard formats,
not proprietary ones, go on the wire.

Kermit now supports both UTF-8 and UCS-2 as File character sets and Transfer
sets (at Level 1 -- one step at a time). Both UCS-2 and UTF-8 are allowed as
Transfer character sets because each one is more (or less) efficent for
certain language groups (roughly CJK vs all others) and we don't want to
favor any group.

Addition of Unicode support to Kermit has obvious benefits:

 1. We can now convert between UCS-2 or UTF-8 and many other character
    sets as part of the data transfer process.

 2. Even when Unicode is not involved at either end, transfers from one
    8-bit set to another can be less lossy: e.g. from DEC Multinational
    to Apple: when we go through Unicode, we don't lose the OE digraph
    as we might when going through (say) Latin-1.

UTF-8 support is relatively straightforward since its byte order is well
defined. UCS-2 is a bit of a mess, though. First, for sanity, we put UCS-2
on the wire in only one form: Big Endian (most significant byte first). Now,
suppose I want to send a UCS-2 encoded file; I still have several

 1. My computer is either Big Endian or Little Endian.

 2. The UCS-2 file has a BOM or it doesn't.

 3. If the UCS-2 file does not have a BOM, it might have its bytes swapped
    relative to the endianness of my computer (e.g. because the file arrived
    via FTP from an other-endian machine), or it might not.

I am assuming that if a BOM is present, I should believe it (is that a good
idea?). Furthermore I should strip the BOM prior to sending, since it serves
no purpose on the wire. The file sender's main task is to ensure the bytes
go out in the right order.

So when reading a UCS-2 file, the file sender should:

 1. Use the BOM if found.

 2. If there is no BOM, assume the local machine's endianness unless
    instructed otherwise.

 3. An override mechanism is need to "instruct otherwise".

Once we know how to read the source file, we can use some well-known table or
algorithm to convert it to the Transfer character set. (Clearly we can lose
information when using an 8-bit TCS.)

Now if UCS-2 itself is the TCS and therefore its byte order is well defined,
it would seem the receiver has several options:

 1. Convert to some other character set (including UTF-8) -- this is

 2. Store as UCS-2 in the local byte order with or without a BOM.

 3. Perhaps even store in swapped byte order (with or without a BOM).

So looks like any data transfer system involving UCS-2 needs controls to
force byte swapping at either end, and to write or not write a BOM to the
destination file.

Does all this sound reasonable?

What about defaults and precedence? When reading a UCS-2 or UTF-8 file,
should the BOM always override any global settings or preferences?

When writing out a UCS-2 file, should we write a BOM by default or only on
request? What about UTF-8?

Finally, about UTF-8 -- there has been some talk recently about "shortest
sequences". Do the words at the top of page A-8 of The Unicode Standard 2.0
still apply? It would seem they are consonant with the well-known dictum
"Be conservative in what you send, liberal in what you accept".

- Frank

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT