DRAFT - L2/02-149

Proposed Draft
Unicode Technical Report #XX

Unicode Compliance Testing

Version	1
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2001-04-11
This Version	n/a
Previous Version	n/a
Latest Version	n/a
Tracking Number	1

Summary

This document describes guidelines for testing programs and systems to see if they support Unicode, and the level of support that they offer.

Status

This document is a proposed draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A list of current Unicode Technical Reports is found on [Reports]. For more information about versions of the Unicode Standard and how to reference this document, see [Versions].

1 Introduction
2 Basic_Conformance
3 Character_Conversion
4 Protocols
5 Programming_Support
6 Analysis
7 Comparison
8 Transformations
9 Keyboard_Input
10 Rendering
References
Acknowledgements
Modifications

1 Introduction

This document describes guidelines for testing programs and systems to see if they support Unicode, and if so, to determine the level of support that they offer. These guidelines explicitly do not test for general internationalization or localization capabilities; those are out of scope for this document.

[TBD: add reasons for testing: e.g. companies wish to assemble systems that handle Unicode correctly.]

Unicode is a very fundamental technology and will appear in many different products: from operating systems to databases, from digital cameras to online games. Thus any tests for Unicode capabilities must be tailored to the specific type of product. Moreover, many of the requirements for Unicode compliance are only applicable to particular products. BIDI conformance, for example, may not be applicable if the product never displays text, but only processes it. Thus all of the following guidelines can only be applied to products that support or require the relevant kinds of processing.

In some cases below, tests are provided for features that are not required for conformance to the Unicode Standard, but are in practice part of what would be expected of a Unicode-capable program or system.

2 Basic Conformance

The most fundamental requirements for Unicode conformance are the following:

Roundtrip: Data is not corrupted or lost
- preserves unassigned characters
- may or may not preserve noncharacters, surrogate code points, canonical forms
Repertoire: General operations are not restricted in repertoire
- e.g. a database Select works on supplementary characters
Canonical Equivalence
- programs gets same results with canonically equivalent text. There are two strategies for accomplishing this:
  - prenormalize all text (and normalize text that results from programmatic manipulations
  - normalize internal to operations: comparing text, rendering it, etc.

Tests

Compliance tests for the first of these are fairly straightforward, with programs that store and retrieve data, such as databases. Here is one example:

Build a small table, insert Unicode data, select the data from the table and compare the results. For instance, use the following SQL statements to create a table named "langs", insert data, select all data and search for one record:

SQL Statements	Results
`drop table langs;`	`The SQL command completed successfully.`
`create table langs (L1 character(10), L2 varchar(18));`	`The SQL command completed successfully.`
`insert into langs values ('Russian', '`русский`');`	`The SQL command completed successfully.`
`insert into langs values ('Spanish', '`Español`');`	`The SQL command completed successfully.`
`insert into langs values ('Czech', '`čeština`');`	`The SQL command completed successfully.`
`insert into langs values ('Greek', '`ελληνικά`');`	`The SQL command completed successfully.`
`insert into langs values ('Japanese', '`日本語`');`	`The SQL command completed successfully.`
`insert into langs values ('Vietnamese', '`Tiểng Việt`');`	`The SQL command completed successfully.`
`select * from langs;`	`L1 L2 ---------- ------------------ Russian` русский`Spanish` Español`Czech` čeština`Greek` ελληνικά`Japanese` 日本語`Vietnamese`Tiểng Việt `6 record(s) selected.`
`select * from langs where L2 like '%`λη`%';`	`L1 L2 ---------- ------------------ Greek` ελληνικά`1 record(s) selected.`

3 Character Conversion

This section is optional, since not every product does--or needs to do--conversion. However, if a product does do conversion, here are the areas to test for:

For all supported character conversions, does it handle illegal and unassigned sequences correctly.
UTF Converters
- If UTF-16, UTF-8, or UTF32 are supported, do they correctly handle BOM.
- If UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE are supported
ISO/IEC 8859 conversions
- If products do provide conversion outside of UTFs, then normally they should at least do the 8859 series (or at least Latin-1 = 8859-1). Those can be tested with the Unicode tables as discussed below.
Others
- This is optional, since most other encodings do not have unique defintions (see UTR #22). Where there is a well-accepted unique definition of an encoding, that can also be tested for. The ICU character mapping tables can serve as a guide.
- Compressions (such as SCSU UTS #6) can also be tested.

Tests

For ISO/IEC 8859 tests, download the files in http://www.unicode.org/Public/MAPPINGS/ISO8859/. The files are of the following format, with two significant fields: the first is a byte, and the second is a code point.

0x00 0x0000 # NULL
...
0xFF 0x00FF # LATIN SMALL LETTER Y WITH DIAERESIS

Conform that the converter converts each of the bytes in the first field to the code point in the second field, and back.
Generate a random selection of code points that are outside of the values in the second field, and convert them. This should at least include U+0212, U+FFFD, U+FFFF, U+10FFFD, and U+10FFFF. Confirm that these generate one of the following (depending on converter options):
- a substitution character (e.g. 0x1A)
- an escape (e.g. Ȓ)
- an error
If the converter distinguishes between illegal (source) values and unassigned values (in the target set), verify that the appropriate responses are generated:
- unassigned: U+0212, U+FFFD, U+10FFFD
- illegal: U+FFFF, U+10FFFF

For compression tests and UTF tests (and if CESU-8 is supported), for each converter:

Verify that every code point from U+0000 to U+10FFFF converts to the UTF and back, returning the same results (e.g. round-trips).
For real text samples, download the text of the files in http://www.unicode.org/standard/WhatIsUnicode.html. Verify that they also round-trip.
Verify that 'code point' values from 0x110000 to 0xFFFFFFFF (incrementing by 0x12345) are treated as illegal.
Verify that illegal code unit sequences treated as illegal.
- UTF-32: the numeric values in #3
- UTF-8: (based on Table 3.1B in UAX #28)
  - 1st bytes 80..C1, F5..FF
  - 2nd-4th bytes outside of the ranges given in 3.1B, according to each 1st byte.
Verify correct conversion
- UTF-8: for each range in the left side of 3.1B, the lowest range in the subsequent table is produced.
  - e.g. U+1000 => E1, 80
- UTF-16*: U+10000 => D800, DC00; U+10FFFF => DBFF, DFFF
BOM:
- In the following table, convert the Bytes column according to the Encoding. The result should match the Code Points column.

Bytes	Encoding	Code Points
EF BB BF E1 88 B4	UTF-8	1234
EF BB BF E1 88 B4	UTF-16/LE/BE	EFBB BFE1 88B4
EF BB BF E1 88 B4	UTF-32/LE/BE	error
FE FF 12 34	UTF-16	1234
FE FF 12 34	UTF-16BE	FEFF 1234
FE FF 12 34	UTF-16LE/UTF-32*	error
FF FE 34 12	UTF-16	1234
FF FE 34 12	UTF-16LE	FEFF 1234
FF FE FF FE 34 12	UTF-16	FEFF 1234
FE FF FE FF 12 34	UTF-16	FEFF 1234

Testing other encodings

Proposal: For major encodings (charset in the sense of IANA) we provide a file with:

Mappings between Unicode code points and byte sequences from that encoding.
Byte sequences from the encoding that are legal (but unassigned)
Byte sequences from the encoding that are illegal

A sample is at sjistest.zip. The mappings are filtered so that they are neutral regarding variation between different encodings with the same name. For example, this sample excludes Private-Use (aka User-Defined codes), "special" characters not defined in the references, and mappings that are known to be ambiguous such as for U+005C

A test would read such a file and verify that the roundtrip-mappings work as specified. Legal (but unassigned) sequences are converted to some Unicode code point: either the replacement character (U+FFFD), the SUB character (U+001A), or a private use character. Illegal sequences are treated as an error condition.

4 Protocols

Protocols should follow these guidelines:

Allow transmission of Unicode (at least one UTF) in any text field.
- Exceptions are items like part numbers or codes.
Never restrict the repertoire of any text field.
- This implies that a field is never restricted to a single UTF-8 or UTF-16 code unit.
Either fully specify the UTF (including endianness) in the protocol (recommended) or use BOM
Specify whether text must be normalized or not; and if so, which normalization form (NFD, NFC, NFKD, NFKC).
Except in very limited circumstances, never restrict a field to single code point (e.g. restricting a currency symbol to a single code point is an error)

Tests

SMTP (with/without MIME) is given as a simple example. For SMTP, there are sending and receiving clients easily available: email applications like Outlook Express and Netscape Messenger.

Proposed test scenario using such a client program:

Create a new document/email.
Set the format to either plain text to test SMTP just by itself or to rich text to test SMTP+MIME.
Set the encoding of the email to UTF-8.
Include in the email body characters from multiple scripts, e.g., Latin, Cyrillic, Arabic, Hindi, Hiragana, Han ideographs, Deseret.
Send the email so that it is processed by the test object.

Sample text for the email body:

Latin: U+00FE ð
Cyrillic: U+0436 ж
Arabic: U+0628 ب
Hindi: U+0905 अ
Hiragana: U+3042 あ
Han Ideograph: U+4E0A 上
Deseret (plane 1): U+1040C 𐐌
Han Ideograph (plane 2): U+20021 𠀡

Test of an SMTP server:

Verify that the email contents is preserved when stored+forwarded through this server.
Requires proper configuration of the email client/network.
In this case, as with some other protocols, the server will almost always just pass the contents through. The test will thus just verify that the server is 8-bit clean, which is almost always the case.

Test of an SMTP client:

Send the email to an address that is handled by a particular client program. Make sure that the text is fully preserved and displayed in a reasonable way (given available fonts etc.).
Example for email clients that are expected to have problems in this area: Eudora, Netscape 4.x (do not use Unicode internally, so must convert to subset-charsets).

Test of an email gateway:

Some email systems (Lotus Notes, X.500, VM) use other protocols than SMTP and transform emails between SMTP and their own formats.
Send emails into such systems and forward/reply them back to a globalization-capable client and verify full roundtrip of the text.
Example for gateways/systems that are expected to have problems: VM (EBCDIC encodings are subsets of Unicode)

Note: Lotus Notes is globalization-capable (should pass the test) because LMBCS can encapsulate Unicode; it will not fully roundtrip arbitrary MIME/HTML formatting, but this is out of scope for G11N certification. All of the characters should roundtrip.

Test of a non-SMTP email client:

A non-SMTP email client would have to get the test email through such a gateway. It is possible that the client may show an email with higher or lower fidelity compared with the roundtrip test into and out of the gateway. Higher if only the second part of the roundtrip were to lose information. Lower if the roundtrip can preserve some or all of the original contents in a form that is not displayed in the non-SMTP client.

Other ways to test protocols

With many IETF (Internet) protocols it is possible to test at least some of the protocol elements using a telnet client or a special-purpose client (e.g., Java application reading/writing to sockets) by reading and writing plain text streams directly, and using UTF-8 text for the contents.

Generally, it may be necessary to write custom test clients/servers to perform meaningful tests of a protocol at all or to automate such tests.

Some protocols (like HTTP) allow many more charsets in direct use than SMTP.
"Direct use" means that UTF-16 is possible in SMTP emails only after a base64-transformation (or quoted-printable), while HTTP allows the contents to be encoded in UTF-16 directly in the byte stream.

SOAP Tests

SOAP is an XML vocabulary being defined by W3C. A SOAP message consists of SOAP envelope, SOAP header and SOAP body. The SOAP body contains user data which is used for RPC function.

<SOAP-ENV:Envelope>
 <SOAP-ENV:Header>
  Additional Information for SOAP message transmission
 </SOAP-ENV:Header>
 <SOAP-ENV:Body>
  Body data of SOAP-RPC message transmission
 </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Sample 1.

SOAP request between service requester and UDDI service provider

All XML char MUST be encoded in UTF-8 in SOAP envelope because UDDI SOAP interface MUST use UTF-8.
MIME MUST be "Content-type : text/xml; charset=UTF-8" as http header.
Max name data is FIVE according to the UDDI V2.0 specification.

POST /uddisoap/publishapi HTTP/1.1
Host: abc.def.com
Content-Type: text/xml; charset=utf-8
Content-Length: nnn
SOAPAction: ""

<?xml version="1.0" encoding="UTF-8" ?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Body>  
    <save_business generic="2.0" xmlns="urn:uddi-org:api_v2">    
      <authInfo>uddiUser</authInfo>
      <businessEntity businessKey="">
         <name xml:lang="ru">русский</name>
         <name xml:lang="cs">čeština</name>
         <name xml:lang="el">ελληνικά</name>
         <name xml:lang="ja">日本語</name>
         <name xml:lang="vi">Tiểng Việt</name>
      </businessEntity>  
    </save_business>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Sample 2.

The expected SOAP message from UDDI service provider for Example 1 above.

All XML characters MUST be encoded in UTF-8 in SOAP envelope because the UDDI SOAP interface MUST use UTF-8.
MIME MUST be "Content-type : text/xml; charset=UTF-8" as http header.

HTTP/1.1 200 OK
Server: ABC
Content-Type: text/xml; charset="utf-8"
Content-Length: nnnn
Connection: close

<?xml version="1.0" encoding="UTF-8" ?>
  <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
    <SOAP-ENV:Body>
     <businessDetail generic="2.0" xmlns="urn:uddi-org:api_v2" operator="operator">
       <businessEntity businessKey="14821BDD-00EA-4398-8003-24BC35F0394A" operator="operator" authorizedName="uddiUser">
        <discoveryURLs>
          <discoveryURL useType="businessEntity">http://abc.def.com:9080/uddisoap/get?businessKey=14821BDD-00EA-4398-8003-24BC35F0394A
          </discoveryURL>
         </discoveryURLs>
         <name xml:lang="ru">русский</name>
         <name xml:lang="cs">čeština</name>
         <name xml:lang="el">ελληνικά</name>
         <name xml:lang="ja">日本語</name>
         <name xml:lang="vi">Tiểng Việt</name>
      </businessEntity>
    </businessDetail>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Sample 3.

Java program (SaveBusinessExample.java) using SOAP interface in UDDI4J to generate sample 1.

5 Programming Support

Programming Language support includes both the basic programming language, and libraries that supplement the basic support with additional functionality. Thus, for example, even though the basic support in C for Unicode is fairly rudimentary, there are supplementary libraries that provide full-featured Unicode support.

Testing for full internationalization support is beyond the scope of this document, but the language (supplemented by libraries) can be tested for the following.

Are the fundamental string functions supported, for at least one UTF?
- concatenation
- substring
- iteration by code point*
- extraction of code point*
- binary comparison (in code point* order)
Are string literals supported for UTF-8, UTF-16, UTF-32 encoding forms?
Is the support for Unicode at least as good as for other encodings?
Is conversion supported? (see Character Conversion)
In the debugger, are Unicode characters displayed reasonably?
In the program editor, are Unicode characters displayed? Can you search for them?

* This does not imply that there cannot be code unit functions as well.

Tests

The StringTest.txt file contains machine-readable tests for code point operations.

6 Analysis

Analysis includes character properties, regular expressions, and boundaries (grapheme cluster, word, line sentence breaks). In this area, typically the tests will check against the UCD properties, plus the guidelines for how those properties are used. The exact formulation of the test will depend on the API and language involved. The main features to test for are:

If property APIs are available, do they match the UCD definitions? (see UCD)
- In particular, this includes the properties in PropertyAliases.txt
- Decimal values UCD
Regular Expressions: do these offer at least Level 1 support.
Char/Word/Sentence Breaks: do these match the guidelines?
Well-Formed Tests
- Test any API that purports to detect valid IDS with IDSTest.txt
- Test any API that purports to detect valid Annotation sequences with AnnotationTest.txt

Tests

For testing Unicode properties, a small test program should be written that for each property:

parses the relevant file(s) in the UCD
collects the property values for each code point
for each code point from U+0000 to U+10FFFF
- calls the programmatic API,
- verifies the property value against the UCD value.

For regular expressions, UTR #18 provides 3 levels for regular expressions. The feature sets in these levels can be tested for explicitly. Note: the TR does not require any particular syntax, so any tests have to be adapted to the syntax of the regular expression engine.

[TBD: Char/Word/Sentence Breaksplaceholder, once UTR #29 gets further along.]

7 Comparison

Comparison includes both binary comparison, and comparison based on UCA (UTS #10). In the latter case, it includes string comparison, string search, and sortkey generation.

Tests

Binary Comparison: There are two common binary comparison orders: UTF-16 order and code point order. The test file [TBD] has the following format:

0061; 0062; LESS; LESS;
FFFF; 10FFFF; LESS; GREATER;
FFFF; 10FFFF; LESS; GREATER;

Collation (UCA UTS #10): Verify that the default collation sequence follows the UCA, using the test files in http://www.unicode.org/Public/BETA/UCA/

String Search: Verify that the locale-sensitive string search functions follow the UCA, according to StringSearchTest.txt.

8 Transformations

Transformations include case mapping/folding and normalization,

Case Mapping/Folding UAX #21
- [TBD]
Normalization UAX #15
- Test any supported forms according to Conformance_Testing

9 Keyboard Input and Editing

The main goals of keyboard input and editing tests are to verify that:

characters are not corrupted with editing (e.g. half-chars)
all the characters for documented repertoire are allowed
selection/mouse actions match characters that will be affected by operations

Test

Do the following for each C in the following list. If C is a sequence, it should still behave in editing as if it is a single letter.
- Input C.
- Backspace. The display should now be empty.
- Input C.
- Right Arrow. Cursor should now be before C
- Delete. The display should be empty
- Input C multiple times, until the line wraps. Check that the character at the end of the line and at the start of the next line is not damaged.
C:
- U+3041 HIRAGANA LETTER SMALL A
- U+3094 HIRAGANA LETTER VU
- U+3095 HIRAGANA LETTER SMALL KA
- U+30A1 KATAKANA LEETER SMALL A
- U+30FA KATAKANA LETTER VO
- U+4E00
- U+3400
- U+20000
- U+FF01
- <U+30AB U+3099>: should behave the same as U+30AC (ガ) KATAKANA LETTER GA
- <U+30CF U+309A>: should behave the same as U+30D1 (パ) KATAKANA LETTER PA
- <U+304B U+309A> (HIRAGANA KA with ATAKANA-HIRAGANA SEMI-VOICED SOUND)

10 Rendering

The goals of rendering tests are to verify that for the repertoire supported by the product:

Combining marks placed appropriately with respect to their base character.
Canonically-equivalent text displayed identically?
Lines break in accordance with LineBreak UAX #14, unless documented otherwise.
- (Certain characters, such as ZWNBSP, are marked as normative in UAX #14. They must always exhibit the documented breaking behavior.)
The BIDI algorithm is followed UAX #9
The EA Width algorithm is followed UAX #11
Shaping (Arabic, Devanagari, Tamil, etc., with examples of Joiner/Non-joiner behavior) is minimally acceptable.
For all of the above, printing should match screen display.

Tests

Testing rendering behavior is not generally possible programmatically. There is simply too much variation in the possible acceptable behavior. Moreover, if a system is not documented as supporting a given repertoire of characters (such as Hebrew), then tests of that repertoire are not applicable. The following, however, does provide some guidelines in assessing correct behavior for a supported repertoire.

Non-spacing mark placement

Code Point Sequence	Unacceptable Rendering	Acceptable Rendering
Code Point Sequence	Unacceptable Rendering	Preferred	Fallback
`U+006C, U+006C, U+0323`
`U+006C, U+006C, U+0302`
`U+006F, U+0302, U+0301`

[TBD: Add Thai example]

Linebreak

[TBD: list code points, with break opportunities marked by a vertical bar. Example:

0061 0020 | 0061 # break after spaces]

BIDI

[TBD: Give acceptable and unacceptable Arabic and Hebrew examples]

East Asian Width

[TBD: Give examples of unabashedly wide or narrow. Probably to difficult to test the ambiguous cases.]

Shaping

[TBD: Give acceptable and unacceptable examples of Devanagari, Arabic, Tamil]

References

[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues.
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Reports]	Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[U3.1]	Unicode Standard Annex #27: Unicode 3.1 http://www.unicode.org/unicode/reports/tr27/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Acknowledgements

Thanks to Helena Shih Chapman, Julius Griffith, Kentaroh Noji, Markus Scherer, Baldev Soor, and Israel Gidali for sample tests and feedback.

Modifications

The following summarizes modifications from the previous version of this document.

6	Initial Version

Copyright © 2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

DRAFT - L2/02-149

Proposed Draft Unicode Technical Report #XX

Unicode Compliance Testing

Summary

Status

Contents

1 Introduction

2 Basic Conformance

Tests

3 Character Conversion

Tests

Testing other encodings

4 Protocols

Tests

Proposed test scenario using such a client program:

Test of an SMTP server:

Test of an SMTP client:

Test of an email gateway:

Test of a non-SMTP email client:

Other ways to test protocols

SOAP Tests

Sample 1.

Sample 2.

Sample 3.

5 Programming Support

Tests

6 Analysis

Tests

7 Comparison

Tests

8 Transformations

9 Keyboard Input and Editing

Test

10 Rendering

Tests

Non-spacing mark placement

Linebreak

BIDI

East Asian Width

Shaping

References

Acknowledgements

Modifications

Proposed Draft
Unicode Technical Report #XX