New UTF-8 decoder stress test file

From: Markus Kuhn (
Date: Sun Sep 26 1999 - 12:22:15 EDT

I have updated the UTF-8 decoder stress test file to also cover
overlong sequences, which a good UTF-8 decoder should reject just
like malformed sequences for security reasons.

One part of UTF-8's ASCII compatibility is the property:

  ASCII compatibility of the first kind:

  ASCII bytes (00-7f) will only represent ASCII characters and will not
  show up in other contexts.

Equally important is another property:

  ASCII compatibility of the second kind:

  ASCII characters can only be represented with a single ASCII
  byte (00-7f) and cannot be decoded from other multi-byte sequences.

Section 4 in the test file helps you to establish the robustness of your
decoder here. Testing for overlong UTF-8 sequences is very easy, once
you have fully understood that all overlong sequences fall into one of
the following patterns:

  1100000x 10xxxxxx
  11100000 100xxxxx 10xxxxxx
  11110000 1000xxxx 10xxxxxx 10xxxxxx
  11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx
  11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

For instance, in xterm, all I had to add were the two lines

 /* continuation byte 10xxxxxx found */
 if (screen->utf_char == 0 && ((c & 0x7f) >> (7 - screen->utf_count)) == 0) {
   screen->utf_char = UCS_REPL;


 /* start byte 110xxxxx found */
 if ((c & 0x1e) == 0)
   screen->utf_char = UCS_REPL; /* overlong sequence */

at the right place to catch all overlong UTF-8 sequences and replace
them with the REPLACEMENT CHARACTER. (The second "if" checks the start
character of a 2-byte sequence, the first "if" checks the first
continuation byte of any sequence, where c is the input byte, screen->
utf_char is the UCS-2 word accumulated so far and screen->utf_count is
the expected number of remaining continuation bytes incl. the current

Summary: Adding a safety check to a UTF-8 decoder such that ASCII
compatibility of the second kind is ensured is really trivial, and the
example code on the Unicode ftp site should definitely be corrected

Test your UTF-8 decoder with the attached file! It is very likely that
you will discover strange bugs this way. I haven't yet seen an UTF-8
decoder that was really correct the first time I tested it. Most I saw
treat malformed UTF-8 sequences very badly, xterm being a notable
exception. Netscape is one of the worst and does not even get past test

The test file is also available as

or in

in the examples/ directory. Both directories contain many more
interesting UTF-8 test files, especially for font proof-reading.

Happy decoder testing ...


Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at,  WWW: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT