UTF-8 stress test

From: Markus Kuhn ([email protected])
Date: Wed Apr 14 1999 - 17:29:34 EDT


May be you'll find this useful:

UTF-8 decoder capability and stress test
----------------------------------------

Markus Kuhn <[email protected]> - 1999-04-14

This test text examines, how UTF-8 decoder handle various types of
corrupted or otherwise interesting UTF-8 sequences. According to ISO
10646-1, sections R.7 and 2.3c, a device receiving UTF-8 shall
interpret a "malformed sequence in the same way that it interprets a
character that is outside the adopted subset".

Test sequences (all enclosed in ""):

Correct UTF-8 text (Greek word 'kosme'): "κόσμε"
Correct 2-byte sequence (U+00000080): "€"
Correct 3-byte sequence (U+00000800): "ࠀ"
Correct 4-byte sequence (U+00010000): "𐀀"
Correct 5-byte sequence (U+00200000): "�����"
Correct 6-byte sequence (U+04000000): "������"
Correct 2-byte sequence (U+000007ff): "߿"
Correct 3-byte sequence (U+0000ffff): "￿"
Correct 4-byte sequence (U+001fffff): "����"
Correct 5-byte sequence (U+03ffffff): "�����"
Correct 6-byte sequence (U+7fffffff): "������"
Correct 2-byte sequence (U+0000): "��"
Correct 3-byte sequence (U+0000): "���"
Correct 4-byte sequence (U+0000): "����"
Correct 5-byte sequence (U+0000): "�����"
Correct 6-byte sequence (U+0000): "������"
Unexpected continuation byte (10000000): "�"
Another lonely continuation byte (10111111): "�"
Sequence of 2 unexpected continuation bytes: "�"
Sequence of 3 unexpected continuation bytes: "��"
Sequence of 4 unexpected continuation bytes: "���"
Sequence of 5 unexpected continuation bytes: "����"
Sequence of 6 unexpected continuation bytes: "�����"
Sequence of 7 unexpected continuation bytes: "�������"
Sequence of all 64 possible continuation bytes (10000000-10111111):
"����������������
 ����������������
 ����������������
 ����������������"
Sequence of all 32 first bytes of 2-byte sequences (11000000-11011111),
each followed by a space character:
"� � � � � � � � � � � � � � � �
 � � � � � � � � � � � � � � � � "
Sequence of all 16 first bytes of 3-byte sequences (11100000-11101111),
each followed by a space character: "� � � � � � � � � � � � � � � � "
Sequence of all 8 first bytes of 4-byte sequences (11110000-11110111),
each followed by a space character: "� � � � � � � � "
Sequence of all 4 first bytes of 5-byte sequences (11111000-11111011),
each followed by a space character: "� � � � "
Sequence of all 2 first bytes of 6-byte sequences (11111100-11111101),
each followed by a space character: "� � "
Impossible byte (11111110): "�"
Impossible byte (11111111): "�"
2-byte sequence with last byte missing: "�"
3-byte sequence with last byte missing: "��"
4-byte sequence with last byte missing: "���"
5-byte sequence with last byte missing: "����"
6-byte sequence with last byte missing: "�����"
All these 5 sequences with last byte missing concatenated:
"���������������"

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT