L2/00-373 Karlsson Kent - keka on 10/24/2000 09:18:22 AM To: "Multiple Recipients of Unicore" UTC action on malformed/illegal UTF-8 sequences? The UTF-8 specification already mentions octet values FE and FF as "not used" (page 894 of 10646-1:2000). Why not start from that, which already disallows a subset of "overlong sequences" (namely those of lengths 7 and 8 octets)? One should also avoid mentioning of "overlong sequences" in a specification of this, since 1) it is not precise, and 2) leads implementers to try to figure out exactly what is being ruled out. What is ruled out should be said explicitly and precisely. So, *in addition to the general UTF-8 rules*, I see the following alternatives for the further restrictions to be applied: Alt. 0, current: Forbidding anything over 31 bits (special case of overlong sequences): The octets FE and FF shall not be used. Alt. 1: Forbidding overlong sequences and anything over 31 bits: The octets C0, C1, FE, and FF shall not be used. After an E0 octet the next octet shall be at least A0, after an F0 octet the next octet shall be at least 90, after an F8 octet the next octet shall be at least 88, and after an FC octet the next octet shall be at least 84. Alt. 2: Forbidding overlong sequences and anything over 21 bits: The octets C0, C1, F8, F9, FA, FB, FC, FD, FE, and FF shall not be used. After an E0 octet the next octet shall be at least A0, and after an F0 octet the next octet shall be at least 90. Alt. 3: Forbidding overlong sequences and anything over 10FFFF: The octets C0, C1, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, and FF shall not be used. After an E0 octet the next octet shall be at least A0, after an F0 octet the next octet shall be at least 90, and after an F4 octet the next octet shall be at most 8F. Suggestion: Forbid overlong sequences and anything over 10FFFF: The octets C0, C1, and F5-FF shall not be used. After an E0 octet the next octet (as an unsigned integer) shall be at least A0, after an F0 octet the next octet shall be at least 90, and after an F4 octet the next octet shall be at most 8F. (All numerals above are in hexadecimal notation.) If this is to be applied to the text in annex D of 10646-1:2000, there are some implied changes that I will not detail here.