New UTF-8 decoder stress test file

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sun Sep 26 1999 - 12:22:15 EDT


I have updated the UTF-8 decoder stress test file to also cover
overlong sequences, which a good UTF-8 decoder should reject just
like malformed sequences for security reasons.

One part of UTF-8's ASCII compatibility is the property:

  ASCII compatibility of the first kind:

  ASCII bytes (00-7f) will only represent ASCII characters and will not
  show up in other contexts.

Equally important is another property:

  ASCII compatibility of the second kind:

  ASCII characters can only be represented with a single ASCII
  byte (00-7f) and cannot be decoded from other multi-byte sequences.

Section 4 in the test file helps you to establish the robustness of your
decoder here. Testing for overlong UTF-8 sequences is very easy, once
you have fully understood that all overlong sequences fall into one of
the following patterns:

  1100000x 10xxxxxx
  11100000 100xxxxx 10xxxxxx
  11110000 1000xxxx 10xxxxxx 10xxxxxx
  11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx
  11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

For instance, in xterm, all I had to add were the two lines

 /* continuation byte 10xxxxxx found */
 if (screen->utf_char == 0 && ((c & 0x7f) >> (7 - screen->utf_count)) == 0) {
   screen->utf_char = UCS_REPL;
 }

and

 /* start byte 110xxxxx found */
 if ((c & 0x1e) == 0)
   screen->utf_char = UCS_REPL; /* overlong sequence */

at the right place to catch all overlong UTF-8 sequences and replace
them with the REPLACEMENT CHARACTER. (The second "if" checks the start
character of a 2-byte sequence, the first "if" checks the first
continuation byte of any sequence, where c is the input byte, screen->
utf_char is the UCS-2 word accumulated so far and screen->utf_count is
the expected number of remaining continuation bytes incl. the current
one.)

Summary: Adding a safety check to a UTF-8 decoder such that ASCII
compatibility of the second kind is ensured is really trivial, and the
example code on the Unicode ftp site should definitely be corrected
accordingly.

Test your UTF-8 decoder with the attached file! It is very likely that
you will discover strange bugs this way. I haven't yet seen an UTF-8
decoder that was really correct the first time I tested it. Most I saw
treat malformed UTF-8 sequences very badly, xterm being a notable
exception. Netscape is one of the worst and does not even get past test
2.1.1.

The test file is also available as

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

or in

  http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz

in the examples/ directory. Both directories contain many more
interesting UTF-8 test files, especially for font proof-reading.

Happy decoder testing ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>




This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT