proposal: UTF-20

Date: Mon Jan 25 1999 - 08:56:05 EST


                (forgive me if you saw this before -
                 I have had some trouble sending email recently)

Proposal (informal) of the UCS/Unicode Transformation Format UTF-20

I propose the UTF-20 encoding for Unicode/ISO-10646 as a compromise between
compactness and the use of scalar integers for all characters in planes 0
to 15. Planes 16 and above are not accessible.

Motivation: A minimal format that allows indexing of Unicode characters.

More Details:

In this format, all characters are stored with 20 bits each, and a UTF-20
string contains those characters without padding bits, MSB first. Each two
adjacent characters occupy 5 octets.
The pair U-000jk lmn U-000pqrst is transformed from and to the 5
octets jk lm np qr st.
If there is an odd number of characters in the stream, then the 4 bits
following the last character are undefined and should be 0.
The signature is the UTF-20 form of U+feff: 0f ef fp, where p stands for
the bits 19 to 16 of the second UTF-20 character in the stream.
If a stream contains 1, 2, or 4 octets more than a multiple of 5, then the
last 1, 2, or 1 octets form an invalid character.

Indexing of characters:

Assume an array of unsigned octets b[] with UTF-20 characters, and a
zero-based character index i. To get the scalar value of character i into
the 32b integer s:

int j=(i<<2) + (i>>2); // octet index j=2.5*i -- same for even and odd i
if(i&1) { // i is odd, char starts in mid-octet
    s=((int)(b[j]&0xf)<<16) | ((int)b[j+1]<<8) | ((int)b[j+2]);
} else { // i is even, char starts with full octet
    s=((int)b[j]<<12) | ((int)b[j+1]<<4) | ((int)(b[j+2]>>4));

Treatment of plane 16:

Plane 16 is currently assigned as a private use plane. Its use should be
discouraged. If it is necessary to have more than 6400+64k private use
characters, then one of the planes in { 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
} could be assigned for this purpose.
Since many information technology systems right now ignore characters
beyond the BMP or even corrupt them in conversions (even to and from
UTF-8), the hope is that there are not many users of plane 16 yet.
A limitation to just using planes 0 to 15 also makes it somewhat easier to
check a UTF-8 stream for acceptable characters, because only the first
octet needs to be examined.

I am eager to read comments about this proposal.



Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430

---------------------- Forwarded by Markus Scherer/Raleigh/Contr/IBM on
99-01-19 15:46 ---------------------------

Markus Scherer
99-01-15 11:09

From: Markus Scherer/Raleigh/Contr/IBM@IBMUS
Subject: suggest to limit scalar values to 20b


I have been mostly listening here and reading various Unicode/10646 web
sites. I am sorry if the following has been discussed, I did not check the

There is one little thing that I would like to bring into the discussion:
The range of available code points.

It seems that the original idea was to be able to code everything with 16b,
which turned out to be too little.
Now there are proposed characters in planes 1, 2, and 14, and planes 15 and
16 are for private use.
Effectively, code points are used and proposed from 0000 0000 to 0010 ffff,
using about 20.1 bits, or 21b of regular integers.
Full 32b integers are not very popular for storage etc., so surrogate
pairs/UTF-16 are used for the default encoding. In UTF-8, the
four-byte-codes are not used fully.

I like the limitation of the intended range to less than 31b.
However, I think it makes sense to limit it further to just use 20b, i.e.,
code points 0000 0000 to 000f ffff, and not use the private use plane 16.
This should make it easier to have simple codes with just scalar integers,
like a stream of 5 bytes per 2 characters, and it yields a nice, round
number of bits for the scalar values.

One disadvantage may be that there would be illegal codes possible in the
surrogate pairs/UTF-16 encoding (dbc0...dbff in the first 16b word).
With UTF-8, the valid code range could be checked by just looking at the
first byte of a character since exactly half of the 4B-codes would be used.
With the current range, the second byte needs to be examined, too.
If 6400+64k private use characters are not enough, then one of the 11
unused planes could be used instead of plane 16.




PS: If only the 9bit-byte had become the standard...

Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT