RE: HTML forms and UTF-8

From: François Yergeau (yergeau@alis.com)
Date: Sun Nov 07 1999 - 17:32:57 EST


> De: Glen Perkins [mailto:Glen.Perkins@nativeguide.com]
> Date: dimanche 7 novembre 1999 16:14
>
> What is the best approach for getting data submitted by an
> HTML form into
> Unicode (presumably UTF-8) encoding?

See http://www.w3.org/TR/html40/interact/forms.html#h-17.3 for the official
way to do that. Warning: it doesn't work.

> I'd like to be able to roll out forms in any number of
> languages/scripts and
> have the data returned to the same CGI program (perl_mod or
> whatever) in the
> same encoding, UTF-8, or else determine the encoding of the
> returning data
> and convert to UTF-8 immediately as the first step in the
> CGI/server side
> processing program.

Good idea, but be prepared for a rough ride :-(

The traditional way that forms work is that the data is returned in the same
encoding as the page containing the form. This kind of works when there is
a single page in a single encoding (no transcoding proxy, for instance)
handled by a single CGI script. But it breaks in many cases and does not
allow multilingual content (except when the page is in Unicode, of course).

RFC 2070 tried to improve that by introducing an Accept-Charset attribute on
the INPUT element within forms. HTML 4 (the reference above) adopted that
but moved the attribute to the FORM element. The wording accompanying it is
pretty bad: the fact that it is supposed to influence browser behaviour is
very unclear. Anyway, to tell the browser you want the data in UTF-8, you're
supposed to say this:

 <FORM ACTION="..." METHOD="..." ACCEPT_CHARSET="UTF-8">
 .
 .
 .
 </FORM>

The problem is that most browsers will not listen, and still send the data
in the page's encoding.

People usually deal with that by using a hidden field in the form (<INPUT
TYPE="hidden">). You can put some text in there that will not be shown to
the user but that will be returned as part of the form data. By looking at
the bytes of that text, your CGI can determine its encoding (knowing the
characters in advance) and that is also the encoding of the rest of the
data. If that smells like a hack, looks like a hack and moves like a hack,
that's because it *is* a hack. But that's the best we have in the current
broken architecture of Web forms.

--
François Yergeau



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT