FW: HTML forms and UTF-8

From: Addison Phillips (AddisonP@simultrans.com)
Date: Tue Nov 09 1999 - 00:24:06 EST

-----Original Message-----
From: Addison Phillips [mailto:AddisonP@simultrans.com]
Sent: Monday, November 08, 1999 8:36 AM
To: 'Glen Perkins'; 'Unicode List'
Subject: RE: HTML forms and UTF-8

> Multilingual content is rare. Glen, do you need multilingual content?

Actually, we ALL need to be thinking about this.

The problem is that Web systems can be used across borders (in a
multilingual situation) much more easily than was common in the past.

For example, I've been working on a system that does driving directions. If
you're Italian and want directions from Paris to Berlin, you're going to see
data returned in no fewer than three languages. You WANT to see the actual,
real labels that are on street signs (even if you can't read them--for
example, the hypothetical non-Asian-language-speaking Italian in Japan or
Thailand or cetera).

This is actually a fairly trivial example of what I've been seeing in
e-commerce and web systems lately. Multilingual is a fact of life.


Addison Phillips
SimulTrans LLC

"22 languages. One release date."
-----Original Message-----
From: Glen Perkins [mailto:Glen.Perkins@nativeguide.com]
Sent: Monday, November 08, 1999 12:02 AM
To: Unicode List
Subject: Re: HTML forms and UTF-8

Thus saith Erik van der Poel <erik@netscape.com>:

> > But it breaks in many cases and does not
> > allow multilingual content (except when the page is in Unicode, of
> Multilingual content is rare. Glen, do you need multilingual content?

Well, yes and no. Yes because I'm asking in part simply because I don't know
the answer, so this is partially just a generic question. A generic answer
should cover multilingual content to be maximally generic. Support for
multilingual content is one of the best parts of Unicode.

On the other hand, most real applications, including the one I have in mind
at the moment, only require multiple monolingual pages. With the ability to
specify "answer me in UTF-8" on all of those parallel pages, I can use the
same server side code (e.g. a CGI program) to process the submitted
datastream, regardless of which of the localized pages it came from.

So, an answer that can handle multilingual forms would be the most useful,
in general, but an answer that only works for multiple, single-language
forms would probably suffice for most real purposes.

> > People usually deal with that by using a hidden field in the form
> > TYPE="hidden">). You can put some text in there that will not be shown
> > the user but that will be returned as part of the form data. By looking
> > the bytes of that text, your CGI can determine its encoding (knowing the
> > characters in advance) and that is also the encoding of the rest of the
> > data.
> In theory, if you can reliably label the charset of the HTML document
> containing the form (via HTTP charset and HTML META charset), then the
> form submission should be in that charset too. You can then simply
> insert that charset label in the hidden input field too, and look at
> that when the form submission arrives.

Is it necessary to double label it, i.e. use both the HTTP header *and* the
meta tag equivalent of the HTTP header, or might the meta tag alone suffice?
I'd be interested to know the answer both in theory and in practice.

> However, perhaps people have found that HTTP charset and HTML META
> charset do not work with certain browser versions, and have therefore
> come up with the hack mentioned above(?).
> Some versions of Netscape do not have a useful default font for use with
> documents in the Unicode-based charsets (utf-8, etc). Even if the user
> has set a font for Unicode, it could be an ugly font. So it might be
> better to send the form in a traditional charset (such as Shift_JIS,
> Big5, etc), so that a more beautiful font is likely to be used on the
> user's side. You can then convert the form submission to UTF-8 on the
> server side.
> This doesn't sound good, I know...

Well, for the purposes of getting multiple monolingual pages to work, it
might be more practical. It sounds as though specifying the encoding as one
of the returned hidden fields might be a way to get a bit more real-world
reliability. You would have each of your forms encoded in the most popular
legacy charset for whatever language it contained, and every form would
submit to the same server-side proc. code, submitting the encoding name
hidden among the data items. It doesn't meet the theoretical need for nice,
multilingual pages. That's too bad for all of us on this mailing list, but
does this sound like a reliable solution for the multiple monolingual
situation without turning to the technique of returning some known bytes and
then analyzing them to determing their encoding?

Glen Perkins

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT