Working DRAFT - L2/03-196

Proposed Draft
Unicode Technical Report #XX

Assessing Unicode Support

Version	6
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2003-06-09
This Version	n/a
Previous Version	n/a
Latest Version	n/a
Tracking Number	1

Summary

This document describes guidelines for testing programs and systems to see if they support Unicode, and the level of support that they offer. It is an anticipated that subsequence versions of this document will be expanded, especially to reflect progress in the development of UTR #23: Character Properties and UTR #30: Character Foldings.

Status

This document is a proposed draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Introduction
- 1.1 Conformance Issues
- 1.2 Guidelines
2 Basic Requirements
- 2.1 Canonical Equivalence
- 2.2 Basic Requirements Tests
3 Character Conversion
- 3.1 Character Conversion Tests
4 Protocol Guidelines
- 4.1 Protocol Guideline Sample Tests
5 Programming Support
6 Analysis
- 6.1 Analysis Tests
7 Comparison
- 7.1 Comparison Tests
8 Transformations
- 8.1 Transformation Tests
References
Acknowledgements
Modifications

1 Introduction

In today's world, software components must interact with a wide variety of other components. Systems often consist of components running on different machines and different platforms, all communicating with one another in complex ways. Unicode is fundamental to providing seamless support of all world languages, and it will appear in many different products: from operating systems to databases, from digital cameras to online games. When assembling systems, it is crucial to be able to ensure that all the different components of a system support Unicode; otherwise weak links may degrade or disable the internationalization support offered by the system as a whole.

This document describes guidelines for testing programs and systems to see if they support Unicode, and if so, to determine the level of support that they offer. These guidelines explicitly do not test for general internationalization or localization capabilities; those are out of scope for this document.

Because Unicode is such a fundamental technology, any tests for Unicode capabilities must be tailored to the specific type of product. Moreover, many of the requirements for Unicode support are only applicable to particular products. BIDI conformance, for example, may not be applicable if the product never displays text, but only processes it. Thus all of the following guidelines can only be applied to products that support or require the relevant kinds of processing described in each section.

1.1 Conformance Issues

Any assessment of Unicode support must start with conformance to the Unicode Standard itself. The Unicode Standard is a very large and many-faceted standard. Because of this, and because of the nature of the material in the standard, it may not be clear how to test for conformance to the Unicode Standard.

A conformance test for the Unicode Standard is a list of data certified by the UTC to be "correct" in regard to some particular requirement for conformance to the standard. In some instances, for example, in the implementation of the bidirectional algorithm, producing a definitive list of correct results is difficult or impossible, and in such cases, a conformance test may itself consist of an implemented algorithm certified by the UTC to produce correct results for any pertinent input data. Conformance tests for the Unicode Standard are essentially benchmarks that someone can use to determine if their algorithm, API, etc., claiming to conform to some requirement of the standard, does in fact match the data that the UTC claims defines such conformance.

Some formal standards are developed once and then are essentially frozen and stable forever. For such standards, stability of content and the corresponding stability of conformance claims is not an issue. For a large, complex standard aimed at the universal encoding of characters, such as the Unicode Standard, such stability is not possible. The standard is necessarily evolving and expanding over time, to extend its coverage of all the writing systems of the world. And as experience in its implementation accumulates, further aspects of character processing also accrue to the formal content of the standard. This fundamentally dynamic quality of the Unicode Standard complicates issues of conformance, since the content to which conformance requirements pertain continually expands, both horizontally to more characters and scripts, and vertically to more aspects of character processing.

The Unicode Standard is regularly versioned, as new characters are added. A formal system of versioning is in place, involving major, minor, and update versions, all with carefully controlled rules for the type of documentation required, handling of the associated data files, and allowable types of change between versions. For more information about the details of Unicode versioning see [Versions]. Conformance claims clearly must be specific to versions of the Unicode Standard, but the level of specificity needed for a claim may vary according to the nature of the particular conformance claim being made.

Systems may also claim conformance to specifications that are outside of the Unicode Standard proper, such as to the Unicode Collation Algorithm, or to Unicode Regular Expressions, Level 1. Unicode-conformant software does not need to implement these features: however, if a system purports to support one of these features then it can be tested for that feature. This document does include tests for such features, but they are marked as "optional".

1.2 Guidelines

Because the criteria for conformance to the Unicode Standard apply to a wide range of possible systems, sometimes conformance does not require the same level of quality in behavior or display in each system. For example, a badly-drawn low-resolution depiction of an 'a' is conformant, but would not be acceptable in practice.

Since the types of processes are so varied, in many cases below, precise tests cannot be formulated. In such cases, for example for Protocols, examples are given of the types of tests that can be formulated.

In a number of cases, examples are formulated using code snippets. These snippets are only examples; there is no implication about the use of any particular programming language, nor that any particular syntax is better or worse than any other; these are merely examples.

2 Basic Requirements

The most fundamental requirements for Unicode support are the following:

Roundtrip: Data is not corrupted or lost
- preserves unassigned code points
- may or may not preserve noncharacters, surrogate code points, canonically-equivalent forms
Repertoire: General operations are not restricted in repertoire
- e.g. a database Select works on supplementary characters
Canonical Equivalence: Systems respect canonical equivalence

2.2 Basic Requirements Tests

Tests for Roundtrip testing of these are fairly straightforward with components that store and retrieve data, such as databases. Here is one example:

Build a small table, insert Unicode data, select the data from the table and compare the results. For instance, use the following SQL statements to create a table named "langs", insert data, select all data and search for one record:

SQL Statements	Results
`drop table langs;`	`The SQL command completed successfully.`
`create table langs (L1 character(10), L2 varchar(18));`	`The SQL command completed successfully.`
`insert into langs values ('Russian', '`русский`');`	`The SQL command completed successfully.`
`insert into langs values ('Spanish', '`Español`');`	`The SQL command completed successfully.`
`insert into langs values ('Czech', '`čeština`');`	`The SQL command completed successfully.`
`insert into langs values ('Greek', '`ελληνικά`');`	`The SQL command completed successfully.`
`insert into langs values ('Japanese', '`日本語`');`	`The SQL command completed successfully.`
`insert into langs values ('Vietnamese', '`Tiểng Việt`');`	`The SQL command completed successfully.`
`select * from langs;`	`L1 L2 ---------- ------------------ Russian` русский`Spanish` Español`Czech` čeština`Greek` ελληνικά`Japanese` 日本語`Vietnamese`Tiểng Việt `6 record(s) selected.`
`select * from langs where L2 like '%`λη`%';`	`L1 L2 ---------- ------------------ Greek` ελληνικά`1 record(s) selected.`

3 Character Conversion

This section is optional, since not every product does--or needs to do--conversion. However, if a product does do conversion, here are the areas to test for:

Does each supported character conversion handle illegal and unassigned sequences according to UTR #22? Such sequences can be handled in a number of different ways; the only requirement is that illegal and unassigned sequences not be present in the output of the character conversion.
Conversions for UTF-8, UTF-16, and UTF-32 must conform to their specifications in the Unicode Standard. In particular, shortest forms must be tested for.
ISO/IEC 8859 conversions (optional)
- If products do provide conversion outside of UTFs, then normally they should at least do the 8859 series (or at least Latin-1 = 8859-1). Those can be tested with the Unicode tables as discussed below.
Others (optional)
- Testing other conversions is optional, since most other encodings do not have unique definitions (see UTR #22). Where there is a well-accepted unique definition of an encoding, that can also be tested for. The ICU character mapping tables can serve as a guide.
- Compressions (such as UTS #6: SCSU or UTN #5: BOCU-1) can also be tested.

3.1 Character Conversion Tests

For ISO/IEC 8859 tests, download the files in http://www.unicode.org/Public/MAPPINGS/ISO8859/. The files are of the following format, with two significant fields: the first is a byte, and the second is a code point.

0x00 0x0000 # NULL
...
0xFF 0x00FF # LATIN SMALL LETTER Y WITH DIAERESIS

Confirm that the converter converts each of the bytes in the first field to the code point in the second field, and back.
Generate a random selection of code points that are outside of the values in the second field, and convert them. This should at least include U+0212, U+FFFD, U+FFFF, U+10FFFD, and U+10FFFF. Confirm that these generate one of the following (depending on converter options):
- a substitution character (e.g. 0x1A)
- an escape (e.g. Ȓ)
- an error
If the converter distinguishes between illegal (source) values and unassigned values (in the target set), verify that the appropriate responses are generated:
- unassigned: U+0212, U+FFFD, U+10FFFD
- illegal: U+FFFF, U+10FFFF

For compression tests and UTF tests (and if CESU-8 [CESU8] is supported), for each converter:

Verify that every code point from U+0000 to U+10FFFF converts to the UTF and back, returning the same results (e.g. round-trips).
For real text samples, download the text of the files in http://www.unicode.org/standard/WhatIsUnicode.html. Verify that they also round-trip.
Verify that 'code point' values from 0x110000 to 0xFFFFFFFF (incrementing by 0x12345) are treated as illegal.
Verify that illegal code unit sequences are treated as illegal.
- UTF-32: the numeric values in #3
- UTF-8: (based on Table 3-6. Well Formed UTF-8 Byte Sequences in [Unicode])
  - 1st bytes 80..C1, F5..FF
  - 2nd-4th bytes outside of the ranges given in 3.1B, according to each 1st byte.
  - For example, <C0 AF> in a UTF-8 input string cannot be interpreted as if it were U+002F SOLIDUS. Acceptable behavior is any of the following: delete the sequence; throw an error; or substitute a replacement character (such as U+FFFD REPLACEMENT CHARACTER, U+001A SUBSTITUTE, or U+003F QUESTION MARK).
Verify correct conversion
- UTF-8: for each range in the left side of Table 3-6, the lowest range in the subsequent table is produced.
  - e.g. U+1000 => E1, 80
- UTF-16*: U+10000 => D800, DC00; U+10FFFF => DBFF, DFFF
BOM:
- In the following table, convert the Bytes column according to the Encoding. The result should match the Code Points column. Note: There is no mechanism in Unicode to determine whether an EF BB BF at the start of UTF-8 represents a BOM or not; there is no different charset name to indicate that. The following test assumes that it does represent a BOM.

Bytes	Encoding	Code Points
EF BB BF E1 88 B4	UTF-8	1234
EF BB BF E1 88 B4	UTF-16/LE/BE	EFBB BFE1 88B4
EF BB BF E1 88 B4	UTF-32/LE/BE	error
FE FF 12 34	UTF-16	1234
FE FF 12 34	UTF-16BE	FEFF 1234
FE FF 12 34	UTF-16LE/UTF-32*	error
FF FE 34 12	UTF-16	1234
FF FE 34 12	UTF-16LE	FEFF 1234
FF FE FF FE 34 12	UTF-16	FEFF 1234
FE FF FE FF 12 34	UTF-16	FEFF 1234

4 Protocol Guidelines

Unlike some of the other sections, no specific tests are available for this section. Moreover, only general guidelines can be described for protocols, including the following:

Allow transmission of Unicode (at least one Unicode Encoding Form) in any text field.
- Exceptions are items like part numbers or codes.
Never restrict the repertoire of any text field.
- This implies that a field is never restricted to a single UTF-8 or UTF-16 code unit.
Either fully specify the UTF (including endianness) in the protocol (recommended) or use BOM
Specify whether text that is transmitted using the protocol must be normalized on input, or whether it need not be normalized. If text must be normalized, then either the protocol must determine the normalization form (NFD, NFC, NFKD, NFKC), or it must carry a variable that indicates the normalization form.
Except in very limited circumstances, never restrict a field to single code point (e.g. restricting a currency symbol to a single code point is an error)

4.1 Protocol Guideline Sample Tests

SMTP (with/without MIME) is given as a simple example. For SMTP, there are sending and receiving clients easily available: email applications like Outlook Express and Netscape Messenger.

Proposed test scenario using such a client program:

Create a new document/email.
Set the format to either plain text to test SMTP just by itself or to rich text to test SMTP+MIME.
Set the encoding of the email to UTF-8.
Include in the email body characters from multiple scripts, e.g., Latin, Cyrillic, Arabic, Hindi, Hiragana, Han ideographs, Deseret.
Send the email so that it is processed by the test object.
Receive the email and verify the content.

Sample text for the email body:

Latin: U+00FE ð
Cyrillic: U+0436 ж
Arabic: U+0628 ب
Hindi: U+0905 अ
Hiragana: U+3042 あ
Han Ideograph: U+4E0A 上
Deseret (plane 1): U+1040C А?
Han Ideograph (plane 2): U+20021 ࠀ?

Test of an SMTP server:

Verify that the email contents is preserved when stored+forwarded through this server.
Requires proper configuration of the email client/network.
In this case, as with some other protocols, the server will almost always just pass the contents through. The test will thus just verify that the server is 8-bit clean, which is almost always the case.

Test of an SMTP client:

Send the email to an address that is handled by a particular client program. Make sure that the text is fully preserved and displayed in a reasonable way (given available fonts etc.).

Example for email clients that are expected to have problems in this area: Eudora, Netscape 4.x (do not use Unicode internally, so must convert to subset-charsets).

Test of an email gateway:

Some email systems (Lotus Notes, X.500, VM) use protocols other than SMTP and transform emails between SMTP and their own formats. Send emails into such systems, forward/reply them back to a globalization-capable client and verify full roundtrip of the text. The following are examples of gateways/systems that are expected to have problems: VM (EBCDIC encodings have a subset of the Unicode repertoire).

Note: Lotus Notes is globalization-capable (should pass the test) because LMBCS can encapsulate Unicode; it will not fully roundtrip arbitrary MIME/HTML formatting, but this is out of scope for G11N certification. All of the characters should roundtrip.

Test of a non-SMTP email client:

A non-SMTP email client would have to receive the test email through such a gateway. It is possible that the client may show an email with higher or lower fidelity compared with the roundtrip test into and out of the gateway. Higher if only the second part of the roundtrip were to lose information. Lower if the roundtrip can preserve some or all of the original contents in a form that is not displayed in the non-SMTP client.

Other ways to test protocols

With many IETF (Internet) protocols it is possible to test at least some of the protocol elements using a telnet client or a special-purpose client (e.g., Java application reading/writing to sockets) by reading and writing plain text streams directly, and using UTF-8 text for the contents.

Generally, it may be necessary to write custom test clients/servers to perform meaningful tests of a protocol at all or to automate such tests.

Some protocols (like HTTP) allow many more charsets in direct use than SMTP. "Direct use" means that UTF-16 is possible in SMTP emails only after a base64-transformation (or quoted-printable), while HTTP allows the contents to be encoded in UTF-16 directly in the byte stream.

SOAP Tests

SOAP is an XML vocabulary being defined by W3C. A SOAP message consists of SOAP envelope, SOAP header and SOAP body. The SOAP body contains user data which is used for RPC function.

<SOAP-ENV:Envelope>
 <SOAP-ENV:Header>
  Additional Information for SOAP message transmission
 </SOAP-ENV:Header>
 <SOAP-ENV:Body>
  Body data of SOAP-RPC message transmission
 </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Sample 1.

SOAP request between service requester and UDDI service provider

All XML char MUST be encoded in UTF-8 in SOAP envelope because UDDI SOAP interface MUST use UTF-8.
MIME MUST be "Content-type : text/xml; charset=UTF-8" as http header.
Max name data is FIVE according to the UDDI V2.0 specification.

POST /uddisoap/publishapi HTTP/1.1
Host: abc.def.com
Content-Type: text/xml; charset=utf-8
Content-Length: nnn
SOAPAction: ""

<?xml version="1.0" encoding="UTF-8" ?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Body>  
    <save_business generic="2.0" xmlns="urn:uddi-org:api_v2">    
      <authInfo>uddiUser</authInfo>
      <businessEntity businessKey="">
         <name xml:lang="ru">русский</name>
         <name xml:lang="cs">čeština</name>
         <name xml:lang="el">ελληνικά</name>
         <name xml:lang="ja">日本語</name>
         <name xml:lang="vi">Tiểng Việt</name>
      </businessEntity>  
    </save_business>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Sample 2.

The expected SOAP message from UDDI service provider for Example 1 above.

All XML characters MUST be encoded in UTF-8 in SOAP envelope because the UDDI SOAP interface MUST use UTF-8.
MIME MUST be "Content-type : text/xml; charset=UTF-8" as http header.

HTTP/1.1 200 OK
Server: ABC
Content-Type: text/xml; charset="utf-8"
Content-Length: nnnn
Connection: close

<?xml version="1.0" encoding="UTF-8" ?>
  <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
    <SOAP-ENV:Body>
     <businessDetail generic="2.0" xmlns="urn:uddi-org:api_v2" operator="operator">
       <businessEntity businessKey="14821BDD-00EA-4398-8003-24BC35F0394A" operator="operator" authorizedName="uddiUser">
        <discoveryURLs>
          <discoveryURL useType="businessEntity">http://abc.def.com:9080/uddisoap/get?businessKey=14821BDD-00EA-4398-8003-24BC35F0394A
          </discoveryURL>
         </discoveryURLs>
         <name xml:lang="ru">русский</name>
         <name xml:lang="cs">čeština</name>
         <name xml:lang="el">ελληνικά</name>
         <name xml:lang="ja">日本語</name>
         <name xml:lang="vi">Tiểng Việt</name>
      </businessEntity>
    </businessDetail>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

Sample 3.

Java program (SaveBusinessExample.java) using SOAP interface in UDDI4J to generate sample 1.

5 Programming Support

Programming Language support includes both the basic programming language, and libraries that supplement the basic support with additional functionality. Thus, for example, even though the basic support in C for Unicode is fairly rudimentary, there are supplementary libraries that provide full-featured Unicode support. While it would be more efficient and more interoperable if C had the capabilities discussed in section 5.1 below, it is certainly possible to work around those limitations in providing Unicode support based on C. Thus this section is optional.

Typically, code units are treated as primitives (8/16/32-bit unsigned integers); the values are verified at certain defined points in processing. There are two broad strategies for handling this:

If there is a string object with an invariant that the contents are always a valid Unicode encoding form (UTF-8, UTF-16, or UTF-32), then any insertion of an invalid code unit into that sequence would be rejected (e.g. cause an exception).
In the more typical case (at least with 16-bit strings), a string object or character array does not have such an invariant, because it would introduce a significant performance cost and complicate user code. Instead, it is treated as simply a sequence of primitives during manipulation, and invalid values or subsequences are treated as noncharacters. Only when it is explicitly serialized out into a Unicode encoding scheme (UTF-8, UTF-16x, UTF-32x) is it checked for validity.

For more information on the distinction between Unicode strings and Unicode encoding forms, see Chapters 2 and 3 of [Unicode].

5.1 Language Support

The fundamental requirements for good programming language support include at least one datatype (character or string) that can contain the entire repertoire of Unicode characters from \u0001 to \u10FFFF. The names of the datatypes are not important; the ones given below are only examples.

UTF-32 datatype: A Unicode code point datatype whose value space encompasses the entire repertoire of Unicode code points, from U+0000 to U+10FFFF inclusive. For example:

UTF32_t  cp1 = '\U0001D434';
UTF32_t  cp2 = '\u1234';
UTF32_t* sp1 = "ABC \u5678 \U0002F884";

UTF-16 datatype: With the wide variety of Unicode libraries and operating system functions using 16-bit Unicode strings, for interoperability it is incumbent upon a programming language to also supply a UTF-16 datatype, one that contains 16 unsigned bits encompassing a range of integral values from 0x0000 to 0xFFFF inclusive. For example:

utf16_t  cu1 = '\u1234';
utf16_t* su2 = "ABC \u5678 \U0002F884";

String: In some languages there may be no distinct datatype for Unicode code points; it is sufficient that there be the ability to store Unicode code points in strings (see below).

Literals: Although not formally necessary (since any localizable text will be in resources), for full support a language will provide character literal and string literal representations with UTF32 and UTF16 literals, as in the above examples.

5.2 Standard and Supplemental Libraries

Testing for full internationalization support is beyond the scope of this document, but the language (supplemented by libraries) can be tested for the following.

Are the fundamental string supported, for at least one Unicode string type?
- concatenation
- substring
- binary comparison (in code point order: see 7 Comparison)
- if the code unit for the standard string type is not UTF32_t:
  - extraction of both code unit and code point
  - iteration both by code unit and code point
Wherever a implementation provides an API with a character parameter (input or output), it should be possible to perform the same function with a datatype which can contain code points from the entire repertoire of Unicode (U+0000 to U+10FFFF).
- Example: If there is an C API isDigit(char x) that tests whether a character is a digit or not, then there must also be some way to test whether characters above 255 are digits. For example, isDIgit2(uchar32 x) or isDigit3(char* x), or isDigit4(uchar16* x).
Can the program source be in Unicode?
- This condition is not strictly necessary, however, it makes commenting source code much easier for non-English programmers, and allows for real string literals such as x = "αβγ", using the actual characters instead of hex codes.
Can program variables (e.g. identifiers) be in Unicode?
- This condition is not strictly necessary, however, it makes meaningful variable names much easier for non-English programmers.
  - For English-speaking programmers: imagine how it would be for you if all your variable names had to use Cyrillic characters.
- It is especially important in environments used by non-programmers, such as macro languages in spreadsheets.
- If they are available, test to see that they include all Unicode identifiers.
- Note: it is not required for all Unicode White_Space characters to be supported as equivalent to SPACE in terms of delimiting program elements.
- Note: The Unicode Standard guidelines for identifiers allows Format characters (i.e., General_Category=Cf) in identifiers, but filters them out in comparison with other identifiers. Thus the identifier a<RLM>b would be allowed, but compare as equal to the identifier ab. However, if the programming language does not filter out Format characters in comparison, then they should be disallowed in identifiers.

5.3 Development Environments

Integrated Development Environments (IDEs) or the the tools that can be part of a development environment, are subject to the general requirements for Unicode GUIs. Particular features to watch for are:

In the program editor, are Unicode characters displayed? Can they be entered into all normal dialog boxes, e.g. can you search for them?
In the debugger, are Unicode characters displayed legibly? If a character is not visible (such as a control character), is a hex representation easily accessible?

5.4 Programming Support Tests

The StringTest.txt file contains machine-readable tests for code point operations. These can be used for iteration and extraction.

To test to see if all Unicode identifiers are supported, access the DerivedCoreProperties.txt file. Write a file that has each of the following strings in the context of an identifier. Verify that the resulting file (or files: they may need to be broken up to get past compiler memory limitations) can be successfully compiled and linked.

Each ID_Start character
The letter "a" followed by each ID_Continue character
- If the programming language does not filter Format characters, skip them.

This is also a good test of string literals. For example:

utf16_t* a = "a"; // U+0061 ( a )
...
utf16_t* α = "α"; // U+03B1 ( α ) 
...
utf16_t* 串 = "串"; // U+4E32 ( 串 )
...
utf16_t* ������ = "������"; // U+10400 ( ������ )
...
utf16_t* a̖ = "a̖"; // U+0061 ( a ) U+0316 ( ◌̖ )
...

6 Analysis

Analysis includes character properties, regular expressions, and boundaries (grapheme cluster, word, line sentence breaks). In this area, typically the tests will check against the UCD properties, plus the guidelines for how those properties are used. The exact formulation of the test will depend on the API and language involved.

Caution: For the English language, functions such as isLetter() are sufficient for a variety of tasks, such as word-break. With the wide variety of languages, scripts, and types of characters supported by the Unicode Standard, this is not true. The presence of non-spacing marks in Arabic, for example, will cause any naïve use of such functions to give incorrect results. More sophisticated mechanisms must be used for determining such tasks.

Note: It is perfectly conformant to supply additional, tailored behavior (such as the results of property APIs, or different word breaks) that is different than the Unicode default behavior, as long as such behavior does not purport to follow the Unicode default specifications.

The main features to test for are the following.

Properties: If property APIs are available and purport to match the UCD definitions, then test whether they match by comparing the results of calling those APIs with the actual UCD data tables.
- A list of properties is found in PropertyAliases.txt
- In particular,
  - if isLetter() is supported, it should use the property from the UCD: Alphabetic = true
  - if isDigit() is supported, it should use the property from the UCD: Numeric_Type = Decimal_Digit
Regular Expressions (optional): test whether these offer at least Level 1 support as specified in UTR #18: Unicode Regular Expression Guidelines.
Char/Word/Line/Sentence Breaks (optional): test any API that purports to implement the default Unicode behavior (described in UAX #14: Line Breaking Properties and UAX #29: Text Boundaries).
- Note: these algorithms may be tailored to be different than the default, as long as it is documented.
Case Detection: If APIs are available for case determination, test that they follow the guidelines in Chapter 3 of [Unicode]. Test all supported API using CaseTest.txt file.
- isUppercase
- isLowercase
- isTitlecase
- isCaseFold
- Note: these algorithms may be tailored to be different than the default, as long as it is documented.
Normalization Detection. Test any supported forms according to the section on Conformance_Testing.
- isNFC
- isNFD
- isNFKC
- isNFKD
Well-Formed Sequences: Ideographic Description Sequences and Annotation Sequences have syntactic restrictions. If these are supported, they can be tested with IDSTest.txt and AnnotationTest.txt.
- However, these particular features are generally not core to the functioning of any systems (annotation characters are also discouraged for interchange), so these tests can be omitted.

6.1 Analysis Tests

For testing Unicode properties, a small test program should be written that for each property:

parses the relevant file(s) in the UCD
collects the property values for each code point
for each code point from U+0000 to U+10FFFF
- calls the programmatic API,
- verifies the property value against the UCD value.

For regular expressions, UTR #18 provides 3 levels for regular expressions. The feature sets in these levels can be tested for explicitly. Note: the TR does not require any particular syntax, so any tests have to be adapted to the syntax of the regular expression engine.

For case detection, test with the following file [TBD]. In addition, verify that the functions respect canonical equivalence by applying all functions to each field in NormalizationTest.txt, and verifying that the same answer is produced.

For default grapheme-cluster, word, line and sentence boundaries, the following tests can be used.

Common APIs will test a particular offset to see if it is a boundary, and also iterate (e.g. find the next boundary). Verify that both APIs provide the same results on all of the test cases, by iterating over each test case, and independently determining the boundaries one at a time, then comparing the two sets of results.

The Well-Formed tests, as mentioned above, are rarely worth testing for. However, if those features are important for a particular application, the following can be used.

Test any API that purports to detect valid IDS with IDSTest.txt
Test any API that purports to detect valid Annotation sequences with AnnotationTest.txt

7 Comparison

Comparison includes both binary comparison, and culturally-sensitive comparison based on UCA (UTS #10). In the latter case, it includes string comparison, string search, and sortkey generation. If globalized implementation sorts, compares, or searches text, and the results of such an operation are apparent to end users, the implementation must use a culturally-sensitive comparison algorithm.

In the case of Collation (and the related StringSearch), only the default collation ordering can be tested, since there is currently no accepted repository of machine-readable tailorings for different languages.

Note: It is perfectly conformant to supply additional, tailored behavior (such as the results of collation ordering) that is different than the Unicode default behavior, as long as such behavior does not purport to follow the Unicode default specifications.

Binary comparison works by lexically comparing strings. The first string unit difference "wins". Unicode has three encoding forms for processing: UTF-8/16/32. Software needs to be able to perform either comparison regardless of its native Unicode encoding form to achieve the same binary order for sorted data structures (lists, trees, etc.) as other software in a connected system. For example, Java Servlets vs. UTF-8 database.

In UTF-32, string units are directly code points, so its binary order is the same as code point order.
UTF-8 has the same binary order because of its encoding mechanism.
UTF-16 has a slightly different order: a standard strcmp-style binary comparison on UTF-16 strings sorts U+0000..U+DFFF, then U+10000..U+10FFFF, then U+E000..U+FFFF.

7.1 Comparison Tests

Binary Comparison: There are two common binary comparison orders: UTF-16 order and code point order. The test file BinaryComparisonTest.txt has the following format:

<string1> ; <string2> ; <code point relation> ; <UTF-16 relation>

For example:

0061; 0062; LESS; LESS;
FFFF; 10FFFF; LESS; GREATER;
FFFF; 10FFFF; LESS; GREATER;

Collation (UCA UTS #10): If the process purports to support the UCA, verify the default collation sequence using the test files in http://www.unicode.org/unicode/reports/tr10/#Test. If both sortkey generation and

String Search: Verify that the locale-sensitive string search functions follow the UCA, according to StringSearchTest.txt. Note: This needs to be fleshed out more.

Case-Insensitive Compare: Verify that that the results follow the guidelines in Chapter 3 of [Unicode]. In particular, that that caseInsensitiveCompare(x, y) = compare(toCaseFold(x), toCaseFold(y)).

Test all supported API using CaseTest.txt file.

8 Transformations

Transformations are functions that take a string as input, and produce a (perhaps) modified string as output. They include case conversion and normalization. These are described in detail in Chapter 3 of [Unicode] and in UAX #15.

Note: It is perfectly conformant to supply additional, tailored behavior (such as the results of case folding) that is different than the Unicode default behavior, as long as such behavior does not purport to follow the Unicode default specifications. However, it must be clear to programmers and end users that the default Unicode behavior is not being followed.

8.1 Transformation Tests

The following tests can be used.

Case Mapping/Folding:
- Test all supported API using CaseTest.txt file.
  - toUppercase
  - toLowercase
  - toTitlecase
  - toCaseFold
- Note: these algorithms may be tailored to be different than the default, as long as it is documented.
Normalization
- Test any supported forms according to the section on Conformance_Testing.
  - toNFC
  - toNFD
  - toNFKC
  - toNFKD

References

[CESU8]	UTR #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) http://www.unicode.org/reports/tr26/
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues.
[Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[Reports]	Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode]	The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1. For an online version, see: http://www.unicode.org/versions/Unicode4.0.0/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/unicode/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Acknowledgements

Thanks to Helena Shih Chapman, Julius Griffith, Markus Scherer, Baldev Soor, Akio Kido, Kentaroh Noji, Takaaki Shiratori, Xiao Hu Zhu, Geng Zheng, CP Chang, Matitiahu Allouche, Tarek Abou Aly, Ranat Thopunya, and Israel Gidali for their many contributions to this document.

Modifications

The following summarizes modifications from the previous version of this document.

Incorporated feedback from UTC
Updated to Unicode 4.0.
Misc. edits

Previous Modifications

Incorporated feedback from UTC
Reworded conformance sections
Dropped keyboard input and rendering section
Changed focus of document to assessing Unicode support, and added more qualifying language about tailorings and defaults.
Added substantial new sections on:
- 1.1 Conformance_Issues (with text from Ken's document)
- 1.2 Guidelines
- 2.1 Canonical Equivalence
Added new programmatic tests, in particular:
Added new draft rendering/input tests for Arabic, Hebrew, and Thai.
Substantial rework of the text in general.
Removed all conversions except UTFs and 8859-* from 3.1 Character Conversion Tests
Broke 5 Programming_Support into separate sections, and added to the test section.

Copyright © 2002-2003 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.