From: Philippe Verdy (firstname.lastname@example.org)
Date: Tue Jan 13 2004 - 18:59:01 EST
From: "Peter Kirk" <email@example.com>
> On 13/01/2004 13:35, Philippe Verdy wrote:
> > ...
> >If your form page uses ISO-8859-1, then specify explicitly the ISO-8859-1
> >encoding as the one to use for submitting forms, as an explicit attribute
> >your <form> element. But then visitors won't be able to send other
> >than ISO-8859-1 in their form data, whever the form method is GET with
> >URL-encoding, or POST in standard form-data format.
> Is this actually true? Other characters can be entered into an
> ISO-8859-1 form in the format "&#nnn;"; or at least Mozilla 1.5 uses
> this format. I suspect this is what happened to me recently when I typed
> a schwa into a message in the webmail interface of a Yahoo group, and
> this appeared in my mail received from the group as "ə" - because
> the message source contained "&#601;". The problem seems to be that
> the process reading the form data was not expecting this format and so
> took the & as a literal rather than as an escape.
It's true that you can pre-feed the form data within your HTML page encoded
with ISO-8859-1 using numeric character entities to specify non-ISO-8859-1
characters. If you try to submit it with a form specifying that it should be
encoded with ISO-8859-1, the browser may not notice that this pre-feeded
data (which still appeared correct in the rendered form) was bogous and
normally impossible to encode with ISO-8859-1.
What browsers do when they find form data which should not be encodable with
the specified charset is still unpredictable. Normally the form data in the
should be reencoded in the specified encoding. But the browser should
immediately to the user that some pre-feeded data in the form is bogous and
some characters will immediately appear as "?". If the browser does not do
because it prefers to render the form even with its bogous data impossible
submit as is, then the browser should check that the edited form data can be
safely encoded into the target encoding specified in the form, or the
of the HTML page if it is not specified.
Most HTML forms I have seen nearly never specify the encoding for submitting
form data. So most browsers assume that form data uses the same encoding as
the HTML page, even if there are numeric character references.
But your claim that a browser would send form data containing numeric
references is wrong here: it violates the format needed for forms submitted
by "GET" method (should be UTF-8 unless something else is specified or the
is not encoded with UTF-8, and then URL-encoded), or "POST" method.
I don't know which other of these two submission formats are supported by
browsers, but I think that browsers should now adopt some XML format for
form data submitted by "POST". This way, browsers will be able to use
cahracter references for characters not supported in the selected target
As UTF-8 is also the default encoding for XML files, browsers would in fact
need to specify it in the XML declaration of their POST'ed form data
Is there now a defined schema for sending POST data with a registered
media-type supported by browsers and that could be specified as the
format attribute of the HTML form? Will Apache or script processors like PHP
support this new XML-formated form data, instead of the legacy URL-formatted
data and the poor, INI-like, POST variable assignments?
Browsers that don't support the new format would still use the default
GET and POST, but there, it should be impossible to encode all characters if
target submission encoding is not UTF-8. Such impossibility to encode these
characters properly in the submitted form data should be signaled to the
instead of being sent unreliably and invisibly. I think it's a deficiency of
and something that the W3C has not specified with enough precision so that
could be corrected in Internet Explorer-based and Mozilla Gecko-based
and in Opera (which are now more than 98% of the total browser market).
This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 19:32:59 EST