RE: HTML forms and UTF-8

From: François Yergeau (yergeau@alis.com)
Date: Sun Nov 07 1999 - 21:32:41 EST


> De: Erik van der Poel [mailto:erik@netscape.com]
> Date: dimanche 7 novembre 1999 19:40
>
> I'm just curious, but how often do people actually use transcoding
> proxies?

Not very often I guess, but there are cases. Think of an automatic
translation server such as AltaVista's. Translating a Latin-1 English form
to Japanese, the result is sure to be in Shift-JIS or some such (correctly
labelled, we can presume). It's not going to be pretty if the translated
form, filled-in, is submitted to the original CGI which expects Latin-1.

Another case is DynaText and similar servers that generate content on the
fly, deciding on the charset (often based on heuristics) only after having
received the HTTP request. Strictly speaking, these are not proxies but the
problem is the same: since HTTP is stateless, the CGI (the right hand of the
server) has no idea what the generator (the left hand) has just sent out and
therefore what charset the form submission will come in.

> Do these proxies automatically update the HTTP and HTML META
> charsets?

In the cases above, yes.

> Multilingual content is rare. Glen, do you need multilingual content?

I can't speak for Glen, but I have had the occasion in the past to fill-in
my name and address in Japanese forms, with less than optimal results.
Unicode is designed to solve such problems and Glen is taking the right
route, IMHO, even though a broken Web protocol gets in the way.

> In theory, if you can reliably label the charset of the HTML document
> containing the form (via HTTP charset and HTML META charset), then the
> form submission should be in that charset too. You can then simply
> insert that charset label in the hidden input field too, and look at
> that when the form submission arrives.

Doesn't work through transcoding (incl. translation) servers. I've also
heard stories of old Japanese browsers that would transcode the input to the
platform encoding and then forget what the original was. So forms are
submitted in the platform encoding, regardless. Certainly broken, probably
mostly extinct by now, but still shows how a bad protocol can come and bite
you.

> Some versions of Netscape do not have a useful default font
> for use with
> documents in the Unicode-based charsets (utf-8, etc). Even if the user
> has set a font for Unicode, it could be an ugly font.

Which just shows that the concept of a single "Unicode font" to support
Unicode encodings is insufficient and wanting. Tango implemented what is
today called font linking way back in 1995, averting the need for such
unwieldy fonts and allowing language-sensitive font selection.

I sincerely hope to see Netscape catching up soon, so that UTF-8 pages can
be shipped around without fear of browsers not configured for it or ugly
Unicode fonts. It will take a while for sufficient deployment, but then we
can have a really Unicode Web (which will also alleviate the brokenness of
the form submission mechanism).

> So it might be
> better to send the form in a traditional charset (such as Shift_JIS,
> Big5, etc), so that a more beautiful font is likely to be used on the
> user's side. You can then convert the form submission to UTF-8 on the
> server side.

After having reliably determined the charset...

--
François



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT