Re: HTML forms and UTF-8

From: Chris Wendt (
Date: Sun Nov 07 1999 - 22:33:29 EST

François has a typo in here:


should be


(note the dash instead of underscore)

Internet Explorer 5:

1) Sets a hidden field named "_charset_" to the encoding the FORM data was
submitted in. ("_charset_" name includes the underscores).
2) Submits in UTF-8 if Accept-Charset="UTF-8" is given in the <form> and
input is found which does not fit into the form page's encoding.
3) If Accept-Charset is not specified or not set to UTF-8, Internet Explorer
5 will submit in the form page's encoding, given that support for that
encoding is present on the client machine.
4) Submits the form data which does not fit into the used encoding as HTML4
Numeric Character References.

Internet Explorer 4 shares feature 4) above and always submits in the form
page's encoding.

You could prepopulate the _charset_ field with the form page's encoding so
it always gets returned to your CGI.


----- Original Message -----
From: François Yergeau <>
To: Unicode List <>
Sent: Sunday, November 07, 1999 2:38 PM
Subject: RE: HTML forms and UTF-8

> De: Glen Perkins []
> Date: dimanche 7 novembre 1999 16:14
> What is the best approach for getting data submitted by an
> HTML form into
> Unicode (presumably UTF-8) encoding?

See for the official
way to do that. Warning: it doesn't work.

> I'd like to be able to roll out forms in any number of
> languages/scripts and
> have the data returned to the same CGI program (perl_mod or
> whatever) in the
> same encoding, UTF-8, or else determine the encoding of the
> returning data
> and convert to UTF-8 immediately as the first step in the
> CGI/server side
> processing program.

Good idea, but be prepared for a rough ride :-(

The traditional way that forms work is that the data is returned in the same
encoding as the page containing the form. This kind of works when there is
a single page in a single encoding (no transcoding proxy, for instance)
handled by a single CGI script. But it breaks in many cases and does not
allow multilingual content (except when the page is in Unicode, of course).

RFC 2070 tried to improve that by introducing an Accept-Charset attribute on
the INPUT element within forms. HTML 4 (the reference above) adopted that
but moved the attribute to the FORM element. The wording accompanying it is
pretty bad: the fact that it is supposed to influence browser behaviour is
very unclear. Anyway, to tell the browser you want the data in UTF-8, you're
supposed to say this:


The problem is that most browsers will not listen, and still send the data
in the page's encoding.

People usually deal with that by using a hidden field in the form (<INPUT
TYPE="hidden">). You can put some text in there that will not be shown to
the user but that will be returned as part of the form data. By looking at
the bytes of that text, your CGI can determine its encoding (knowing the
characters in advance) and that is also the encoding of the rest of the
data. If that smells like a hack, looks like a hack and moves like a hack,
that's because it *is* a hack. But that's the best we have in the current
broken architecture of Web forms.

François Yergeau

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT