To: |
UTC |
From: |
Mark Davis |
Date: |
2000-10-15 |
Re: |
UTF-8 and
"Non-Shortest Form" (R2) |
The following is a proposal for tightening
up the language for UTF-8 to close the "non-shortest form" issue. It
is modified from the previous paper, taking in feedback from the mailing list.
I suggest that two additional steps be taken:
The
current C12 forbids the generation of "non-shortest form",
and forbids the interpretation of illegal sequences, but not
non-shortest form. We still need to allow for programs that do fast processing
with no error checking where the data is guaranteed to be well-formed, but
otherwise we can extend the prohibition on illegal byte sequences to be all
ill-formed byte sequences, not just illegal byte sequences. We also need to
change the definition of UTF-8 to make absolutely clear what is ill-formed.
(The definition of UTF-8 is in Chapter 3, and duplicated in http://www.unicode.org/unicode/faq/utf_bom.html.)
To
do this, we make the following normative modifications:
Modify
C12 as follows:
C12 |
When a process generates
data in a Unicode Transformation Format, it shall not emit ill-formed byte
sequences. When a process interprets data in a Unicode Transformation Format,
it shall treat ill-formed |
Modify
D36 as follows:
D36 |
UTF-8 is the Unicode
Transformation Format that serializes a Unicode scalar value as a sequence of
one to four bytes, as specified in Table 3.1. A byte sequence is
ill-formed UTF-8 if and only if some sequence of those bytes matches one (or
more) of the lines in Table 3.1b |
Add
the following table and text:
Table 3.1b. Ill-Formed UTF-8 Byte Sequences |
|||||
|
1 Byte |
2 Bytes |
3 Bytes |
4 Bytes |
5 Bytes |
1 |
C0 - C1 |
|
|
|
|
2 |
F5 - FF |
|
|
|
|
3 |
C2-DF |
00-7F, C0-FF |
|
|
|
4 |
E0 |
00-9F, C0-FF |
|
|
|
5 |
E1-EF |
00-7F, C0-FF |
|
|
|
6 |
F0 |
00-9F, C0-FF |
|
|
|
7 |
F4 |
00-7F, 90-FF |
|
|
|
8 |
E0-EF |
XX |
00-7F, C0-FF |
|
|
9 |
F0-F4 |
XX |
00-7F, C0-FF |
|
|
10 |
F0-F4 |
XX |
XX |
00-7F, C0-FF |
|
11 |
ED |
A0-AF |
XX |
ED |
B0-BF |
Table
3.1b. lists all of the byte sequences that are ill-formed in UTF-8. The
"XX" in any cell matches any byte whatsoever, otherwise the specific
byte range is listed. Thus