RE: HTML forms and UTF-8

From: Tim Greenwood (
Date: Mon Nov 08 1999 - 10:21:03 EST

Glenn wrote:

> I'd really like to take the Right Path of encoding the form in UTF-8 and
> having it return the form data in UTF-8, so I could have a
> generic solution
> of any language(s) going out and any language(s) coming back. It
> really does
> have to work, though, or else the people I do it for, who don't know much
> about i18n and therefore hate and oppose it, will say "See! We told you it
> was a bad idea!" Urrrgh.
> Do you know under what circumstances this is likely to work?
> Would it work,
> say, for both IE and Netscape, versions 3 or later, on Win & Macs? I'd
> certainly prefer to be more generic than that (support for unix being
> particularly near to my heart), but current browser stats indicate that
> anything that works on the above (NS/IE 3+ on Win/Mac) would cover a large
> enough percentage of the market to be worth doing. Requiring version 4
> browsers might even be tolerable now in many cases. (And I'm talking about
> the Internet at large, not an intranet.)

I faced the same problem 2-3 years ago when designing our product. My first
thought was to use UTF-8 everywhere, but it does not work across the range
of browsers that you require. During this timeframe I conducted a test on
the then available browsers, using both the header, hidden field and both on
EUC, SJIS, JIS and UTF-8, seeing if the form sent back the issued character
set. The results were:

Netscape Communicator V4
 Japanese Windows 95

Netscape Navigator v3.01 (ja)
 Japanese Windows 95
 OK (no UTF-8 support)

Microsoft Internet Explorer V4
 US Windows 95

Microsoft Internet Explorer V4 Platform preview
 Japanese Windows 95
 OK except that utf-8 gives bad results (a Hebrew font!)

Tango (Alis) v3.0
 Japanese Windows 95

Accent Multilingual Mosaic v1.0
 Japanese Windows 95

Netscape Navigator 2.02 (US)
 Japanese Solaris
 JIS and EUC are OK, SJIS and UTF-8 show as EUC

Microsoft Internet Explorer V4.0
 English NT with Japanese Language pack

The later versions of the major browsers seem to work well - but last time
that I looked (a while ago) there was even a problem with them in that
Netscape required that the user manually configure a font for use with
UTF-8. The Japanese system chose as default a western font as the default
for UTF-8, so sending Japanese in UTF-8 gave unreadable text. IE chose a
Japanese font as default and was ok. This flaw may be corrected now, but I
still do not like the way that Communicator 4.6 handles UTF-8. The Edit
Preferences form has you set fonts for 'Encodings' Most 'Encodings' are
given appropriately in terms that users will understand - Western, Central
European, ... - but it includes Unicode. So, to map for a reasonable set of
scripts you have to choose some font covering a reasonable subset of the
glyphs. I use Bitstream Cyberbit, which is ok across a lot of scripts, but
not optimal for the Latin script (which is the only one that I can judge).
IE5 has you select fonts per language script and does not separate out
Unicode. Presumably it breaks out UTF-8 test into different script systems
and associates the appropriate font with it, allowing the user to get
optimal fonts for all scripts. (and though this is an opening for et old Han
unification/appropriate font argument lets not go there - ok)

The design choice that we made was to send and receive forms in an
appropriate encoding defined in a locale and registry, converting to Unicode
in the cgi for processing. Full details have been presented at Unicode
conferences, I can send a paper to anyone interested.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT