| To | UTC |
| From | Mark Davis |
| Date | 2000-10-15 |
| Re | UTF-8 and "Shortest Form" |
The following is a proposal for tightening up the language for UTF-8 to close the "shortest form" issue.
The current C12 forbids the generation of "shortest form", and forbids the interpretation of illegal sequences, but not shortest form. Here is the current C12.
| C12 | When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition. |
We still want to allow for fast processing with no error checking if the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. Modify it as follows:
| C12 | When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat ill-formed byte sequences as an error condition unless the data is guaranteed to be well-formed. |
| D36 |
UTF-8 is the Unicode Transformation Format that serializes a Unicode
scalar value as a sequence of one to four bytes, as specified in Table
3.1. Such a byte sequence in UTF-8 is ill-formed if it does not
meet the conditions in Table 3.1a.
|
Add the following table:
| 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
|---|---|---|---|
| 00-7F | |||
| C2-DF | 80-BF | ||
| E0 | A0-BF | 80-BF | |
| E1-EF | 80-BF | 80-BF | |
| F0 | 90-BF | 80-BF | 80-BF |
| F1-F3 | 80-BF | 80-BF | 80-BF |
| F4 | 80-8F | 80-BF | 80-BF |