Preliminary Proposal for a new Unicode Text Format

The motivation for this new UTF-c  encoding is provided by the inefficient UTF-8 encoding of most of the world’s alphabetic scripts, where each character is typically encoded in two or three bytes. UTF-c allows one alphabet apart from the Latin/ASCII alphabet to be encoded in one byte per character, provided the letters fall within a 64 character Unicode block. For example the Cyrillic alphabet can be encoded as one byte per character instead of the two bytes used by UTF-8. Also the Indic alphabets can be encoded as one byte per character instead of three.  As an additional benefit, twice as many characters can be encoded in two bytes, and four times as many characters can be encoded in three bytes.

UTF-c also has a four byte file prefix, which identifies the file as a UTF-c text file, and encodes the page number of the selected alphabet. The file prefix consists of four zero width control bytes {FS, GS, RS, US}, so that existing browsers and text editors can handle the files correctly, provided the appropriate code-page or font/script is selected.

Code point

Bits

Binary value

UTF-c bytes

U+00..U+7f

7

0xxxxxxx

0xxxxxxx

U+c0..U+ff
(default)

6

11xxxxxx

11xxxxxx

U+80..U+bf,
U+100..U+107f

12

U – 0x80
0000
yyyy xxxxxxxx

10yyyyxx 10xxxxxx

U+1080 to
U+04107f

18

U – 0x1080
000000
zz yyyyyyyy xxxxxxxx

10zzyyyy 11yyyyxx 10xxxxxx

U+041080 to
U+10ffff

21

U – 0x41080
0000zz
zz yyyyyyyy xxxxxxxx

10ººººzz 11zzyyyy 11yyyyxx 10xxxxxx

Main Features:

·       no null bytes except for null character

·       alphabetic scripts of common languages can be encoded in 1 byte per character

·       backward-compatible with ASCII and other code-pages

·       full Unicode character set, but with no byte-order-marks

·       may be quickly scanned in forward and backward directions

·       avoids over-long forms of characters

The accompanying C++ program converts between UTF-c and UTF-8 text files, and is provided here to give an example of how UTF-c files may be processed.