Re: HTML forms and UTF-8

From: Glen Perkins (
Date: Mon Nov 08 1999 - 04:12:17 EST

De: François Yergeau <>

> > De: Erik van der Poel []
> > Multilingual content is rare. Glen, do you need multilingual content?
> I can't speak for Glen, but I have had the occasion in the past to fill-in
> my name and address in Japanese forms, with less than optimal results.
> Unicode is designed to solve such problems and Glen is taking the right
> route, IMHO, even though a broken Web protocol gets in the way.

I'd really like to take the Right Path of encoding the form in UTF-8 and
having it return the form data in UTF-8, so I could have a generic solution
of any language(s) going out and any language(s) coming back. It really does
have to work, though, or else the people I do it for, who don't know much
about i18n and therefore hate and oppose it, will say "See! We told you it
was a bad idea!" Urrrgh.

Do you know under what circumstances this is likely to work? Would it work,
say, for both IE and Netscape, versions 3 or later, on Win & Macs? I'd
certainly prefer to be more generic than that (support for unix being
particularly near to my heart), but current browser stats indicate that
anything that works on the above (NS/IE 3+ on Win/Mac) would cover a large
enough percentage of the market to be worth doing. Requiring version 4
browsers might even be tolerable now in many cases. (And I'm talking about
the Internet at large, not an intranet.)

> > In theory, if you can reliably label the charset of the HTML document
> > containing the form (via HTTP charset and HTML META charset), then the
> > form submission should be in that charset too. You can then simply
> > insert that charset label in the hidden input field too, and look at
> > that when the form submission arrives.
> Doesn't work through transcoding (incl. translation) servers. I've also
> heard stories of old Japanese browsers that would transcode the input to
> platform encoding and then forget what the original was. So forms are
> submitted in the platform encoding, regardless. Certainly broken,
> mostly extinct by now, but still shows how a bad protocol can come and
> you.

Yes, I obviously need to add to the above IE/NS on Win/Mac specification
that it work on all major language versions of those browsers.

> > So it might be
> > better to send the form in a traditional charset (such as Shift_JIS,
> > Big5, etc), so that a more beautiful font is likely to be used on the
> > user's side. You can then convert the form submission to UTF-8 on the
> > server side.
> After having reliably determined the charset...
> --
> François

So, François, it sounds as though your hack -- returning known data from a
hidden field to determing the encoding -- might be needed as a data
integrity check at the very least.

Now I'm wondering what such data would look like and what could be learned
from it. If I just put a bunch of bytes up there and they're echoed back at
me verbatim, what would that tell me? I can imagine putting up a page
encoded in Shift-JIS with a hidden field also in Shift-JIS, using the
ACCEPT-CHARSET="UTF-8" technique, and then testing the result to see whether
it came back as UTF-8, unchanged, or other. If unchanged, though, would that
mean the returned data really was Shift-JIS? It seems to me it could also be
Big-5, Latin-1, or any of several other encodings, returned by a browser
than used the default system encoding to encode form data.

I could also just skip the ACCEPT-CHARSET, if it seldom works anyway, and
put the pages up in Shift-JIS, Latin-1, Big-5, etc., and then hope the
returned data comes back in the encoding of the page, where I could convert
it to Unicode on the server. How would I know for sure what encoding it came
back in, though? In that case, the returned hidden field would be tested to
make sure it came back unchanged but, again, Big-5 could probably come back
unchanged by a Mainland Chinese browser that is actually sending me GB. What
would the fact that it came back unchanged tell me?

It seems as though this hack really wouldn't help unless you are trying to
use ACCEPT-CHARSET to convert the return stream to something other than the
encoding of the page. If it converts to UTF-8, you can be pretty confident
that the whole stream is UTF-8. If it's just echoed back verbatim, then
perhaps you can assume that it's in the encoding of the form...but I'm not
sure how reliable that assumption is.

It's new to me, so I haven't thought through what would and wouldn't work.
Any ideas? ;-)


Glen Perkins

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT