Re: HTML forms and UTF-8

Date: Mon Nov 08 1999 - 22:16:10 EST

From: François Yergeau <>
>> 2) Submits in UTF-8 if Accept-Charset="UTF-8" is given in the
>> <form> and
>> input is found which does not fit into the form page's encoding.
>Why the latter condition? It seems to me that if the form author says
>Accept-Charset="UTF-8", that's what he wants. This behaviour also >seems
>less deterministic, saying UTF-8 does not cause UTF-8 to be returned >but
MAY cause it, depending on what each user types.

The encoding of the returned data is very deterministic if you post your
page in UTF-8. No browser can submit in a charset it cannot also read.
The behavior is designed like this to resolve the semantic conflict
- if several charsets were named in the accept-charset
- and I didn't equip the browser with a general mechanism to
  determine the "best fit" charset among all charsets
  listed in the <form>.
So the deterministic, downlevel and upward compatible design is:
The encoding of the <form> document is always on top of the stack.
The <form accept-charset>s get tried for "fit" in the stated order. In other
words the page's charset is always first in the list of accept-charsets.
For Internet Explorer 5 we stopped at looking at the first one and looking
only at UTF-8 (because that is really the only one that makes practical
sense). If there is interest and reason, I have no problem to look further
in future releases.
Your CGI can be deterministic regardless by looking at the value of
_charset_. Even if Internet Explorer 5 returned UTF-8 only like in your
example, you would need to write your CGI to accept non-UTF-8 encodings for
version 4 and below browsers. The way as implemented you can satisfy both
with a single and simple CGI without a browser version check:
if (_charset_="UTF-8") {assume UTF-8 encoded data}
else {assume the page's charset - and forget about recovering data outside
the pages' charset}.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT