Jony Rosenne, who has been a great contributor since or before the
beginning, wrote in an off moment:
> UTF-8 is a biased transformation format designed to save American and
> Western Europeans storage space and to give some people a warm feeling by
> keeping Unicode in the familiar 8 bit world.
FYI, below are the design goals of UTF-8 as specified by its originators,
Ken Thompson et al @ ATT.
Date: Tue, 8 Sep 92 03:22:07 EDT
Subject: (XoJIG 620) <Subject missing>
Here is our modified FSS-UTF proposal. The words are the same as on the
previous proposal. My apologies to the author. The code has been tested to
some degree and should be pretty good shape. We have converted Plan 9 to
use this encoding and are about to issue a distribution to an initial set of
File System Safe Universal Character Set Transformation Format (FSS-UTF)
With the approval of ISO/IEC 10646 (Unicode) as an international standard
and the anticipated wide spread use of this universal coded character set
(UCS), it is necessary for historically ASCII based operating systems to
devise ways to cope with representation and handling of the large number of
characters that are possible to be encoded by this new standard.
There are several challenges presented by UCS which must be dealt with by
historical operating systems and the C-language programming environment.
The most significant of these challenges is the encoding scheme used by UCS.
More precisely, the challenge is the marrying of the UCS standard with
existing programming languages and existing operating systems and utilities.
The challenges of the programming languages and the UCS standard are being
dealt with by other activities in the industry. However, we are still faced
with the handling of UCS by historical operating systems and utilities.
Prominent among the operating system UCS handling concerns is the
representation of the data within the file system. An underlying assumption
is that there is an absolute requirement to maintain the existing operating
system software investment while at the same time taking advantage of the
use the large number of characters provided by the UCS.
UCS provides the capability to encode multi-lingual text within a single
coded character set. However, UCS and its UTF variant do not protect null
bytes and/or the ASCII slash ("/") making these character encodings
incompatible with existing Unix implementations. The following proposal
provides a Unix compatible transformation format of UCS such that Unix
systems can support multi-lingual text in a single encoding. This
transformation format encoding is intended to be used as a file code. This
transformation format encoding of UCS is intended as an intermediate step
towards full UCS support. However, since nearly all Unix implementations
face the same obstacles in supporting UCS, this proposal is intended to
provide a common and compatible encoding during this transition stage.
With the assumption that most, if not all, of the issues surrounding the
handling and storing of UCS in historical operating system file systems are
understood, the objective is to define a UCS transformation format which
also meets the requirement of being usable on a historical operating system
file system in a non-disruptive manner. The intent is that UCS will be the
process code for the transformation format, which is usable as a file code.
Criteria for the Transformation Format
Below are the guidelines that were used in defining the UCS transformation
1) Compatibility with historical file systems:
Historical file systems disallow the null byte and the ASCII slash
character as a part of the file name.
2) Compatibility with existing programs:
The existing model for multibyte processing is that ASCII does not
occur anywhere in a multibyte encoding. There should be no ASCII code
values for any part of a transformation format representation of a character
that was not in the ASCII character set in the UCS representation of the
3) Ease of conversion from/to UCS.
4) The first byte should indicate the number of bytes to follow in a
5) The transformation format should not be extravagant in terms of
number of bytes used for encoding.
6) It should be possible to find the start of a character
efficiently starting from an arbitrary location in a byte stream.
The proposed UCS transformation format encodes UCS values in the range
[0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, 5, and 6
bytes. For all encodings of more than one byte, the initial byte determines
the number of bytes used and the high-order bit in each byte is set. Every
byte that does not start 10xxxxxx is the start of a UCS character sequence.
An easy way to remember this transformation format is to note that the
number of high-order 1's in the first byte signifies the number of bytes in
the multibyte character:
Bits Hex Min Hex Max Byte Sequence in Binary
1 7 00000000 0000007f 0vvvvvvv
2 11 00000080 000007FF 110vvvvv 10vvvvvv
3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
The UCS value is just the concatenation of the v bits in the multibyte
encoding. When there are multiple ways to encode a value, for example UCS
0, only the shortest encoding is legal.
Below are sample implementations of the C standard wctomb() and mbtowc()
functions which demonstrate the algorithms for converting from UCS to the
transformation format and converting from the transformation format to UCS.
The sample implementations include error checks, some of which may not be
necessary for conformance:
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT