Public Review Issue #121

L2/08-286

Public Review Issue #121

Recommended Practice for Replacement Characters

The Unicode Technical Committee has been requested to specify what the recommended practice is for replacement characters in handling ill-formed subsequences.

When converting between Unicode encoding forms or when validating Unicode text, there is no requirement that an ill-formed subsequence be replaced using U+FFFD character(s); an application can, for example, throw an exception or substitute other characters such as "?". However, even when replacement with U+FFFD is done, there are several possible approaches that can be taken. The principal three approaches can be stated as policy options:

Replace the entire ill-formed subsequence by a single U+FFFD.
Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD.
Replace each code unit of the ill-formed subsequence by a single U+FFFD.

In these policy statements, "entire ill-formed subsequence" refers to all code units in the ill-formed subsequence up to but not including the start of the next well-formed code unit sequence. The term "maximal subpart of the ill-formed subsequence" refers to the longest potentially valid initial subsequence or, if none, then to the next single code unit.

The following table illustrates the application of these alternative policies for an example of conversion of UTF-8 to UTF-16, the most common kind of conversion for which the differences are apparent and for which a recommended practice would be desirable for interoperability:

      61      F1      80      80      E1      80      C2      62
1   U+0061  U+FFFD                                          U+0062
2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

The UTC has indicated a tentative preference for option #2, but is interested in feedback on what would be the best recommended practice, and reasons for that choice. The UTC also requests feedback about which products or libraries are known to follow these or other policies for replacement of ill-formed subsequences on conversion or validation.