From: Philippe Verdy (email@example.com)
Date: Wed Jan 14 2004 - 08:42:11 EST
From: "Peter Kirk" <firstname.lastname@example.org>
> On 13/01/2004 15:59, Philippe Verdy wrote:
> >From: "Peter Kirk" <email@example.com>
> > ...
> >>Is this actually true? Other characters can be entered into an
> >>ISO-8859-1 form in the format "&#nnn;"; or at least Mozilla 1.5 uses
> >>this format. I suspect this is what happened to me recently when I typed
> >>a schwa into a message in the webmail interface of a Yahoo group, and
> >>this appeared in my mail received from the group as "ə" - because
> >>the message source contained "&#601;". The problem seems to be that
> >>the process reading the form data was not expecting this format and so
> >>took the & as a literal rather than as an escape.
> >It's true that you can pre-feed the form data within your HTML page
> >with ISO-8859-1 using numeric character entities to specify
> >characters. If you try to submit it with a form specifying that it should
> >encoded with ISO-8859-1, the browser may not notice that this pre-feeded
> >data (which still appeared correct in the rendered form) was bogous and
> >normally impossible to encode with ISO-8859-1.
> Just to clarify: the data I was entering was not bogus, but was exactly
> what I wanted to enter and was legal content for the e-mail which I
> wanted to send to the list. The error was at Yahoo, or possibly in my
> browser, in not supporting the characters which I wanted to use. I was
> not informed of any restriction or problem.
Can you exhibit the URL of your entry form or a HTML snapshot of your form
page? It may reveal if it's a problem in the HTML page itself, which does
allow prefeeding an entry form with characters that won't be mapped
correctly with the specified format for submitted data.
I have seen some references to the new XForms schema, but it is not usable
in HTML 4, because it requires recoding the ntry form with a <model> section
(the <form> element is obsolated in XForms).
I would have prefered seeing a formal proposal on the W3C to specify a XML
submission format usable in HTML4 entry forms. For now, I think it's a
violation of the format defined for <form method="POST"> or of the URL
encoding for <form method="GET"> to use numeric character entities, ad both
submission formats are not XML. Browsers should inform their users that some
of their form data cannot be encoded safely in the target charset if this
specified or implied charset is not a Unicode encoding scheme (UTF-8,
UTF-16, UTF-32, SCSU) or a Unicode compatible encoding.
This proposed format would have deprecated the old format for POST data, if
it had used a well defined and standardized XML schema, immediately
recognizable in web servers like Apache or script engines like PHP. Bascally
it should consist of an unordered list of (form input id, form input value)
pairs, both elements in the pair being codable as text elements or element
attribute values and accepting numeric character entities, if it can't be
encoded in the target charset. It would be compatible with XForms by using
an implicit model, associated to the specified format registered and
documented by the W3C.
So instead of using <form> with implicit method="GET" and
enctype="application/x-www-form-urlencoded>, or <form method="POST"
enctype="multipart/formdata">, which both assume a default
accept-charset="UNKNOWN" meaning the charset used to get the HTML document
containing the form, I would have liked to see:
<form method="POST" enctype="text/xml-formdata"
which specifies that the server will be able to process form data encoded as
a XML document conforming to the XML schema specified by the registered MIME
type "text/xml-formdata", this XML document being preferably encoded with
the "ISO-8859-1" charset, using XML numeric character entities if needed to
represent characters that can't fit in ISO-8859-1...
I don't know if such enctype value is supported in browsers, and if there's
an agreement about the (quite basic) XML schema to which it should
correspond. Without it the only solution is that web servers and script
engines be updated to decode correctly the POST data using the charset
indicated in its "Content-Type:" header (or headers of each part in case of
"multipart/formdata"); this is really a problem for HTML form pages not
encoded with a UTF encoding scheme: Do browsers have to use the
accept-charset attribute of <form> elements? Are they allowed to switch to
UTF-8 and specify this encoding in the submitted data in
"application/x-www-form-urlencoded" or "multipart/formdata" content-types?
If so, it seems logical that your form processor will see data encoded with
UTF-8 despite your HTML form page was coded with ISO-8859-1 with a missing
accept-charset attribute (whose default value is "UNKNOWN", but not
necessarily the same as the charset used in the HTML form page...).
---- For reference: http://www.w3.org/TR/html4/interact/forms.html [quote] 17.3 The FORM element <!ELEMENT FORM - - (%block;|SCRIPT)+ -(FORM) -- interactive form --> <!ATTLIST FORM %attrs; -- %coreattrs, %i18n, %events -- action %URI; #REQUIRED -- server-side form handler -- method (GET|POST) GET -- HTTP method used to submit the form-- enctype %ContentType; "application/x-www-form-urlencoded" accept %ContentTypes; #IMPLIED -- list of MIME types for file upload -- name CDATA #IMPLIED -- name of form for scripting -- onsubmit %Script; #IMPLIED -- the form was submitted -- onreset %Script; #IMPLIED -- the form was reset -- accept-charset %Charsets; #IMPLIED -- list of supported charsets -- > Start tag: required, End tag: required Attribute definitions action = uri [CT] This attribute specifies a form processing agent. User agent behavior for a value other than an HTTP URI is undefined. method = get|post [CI] This attribute specifies which HTTP method will be used to submit the form data set. Possible (case-insensitive) values are "get" (the default) and "post". See the section on form submission for usage information. enctype = content-type [CI] This attribute specifies the content type used to submit the form to the server (when the value of method is "post"). The default value for this attribute is "application/x-www-form-urlencoded". The value "multipart/form-data" should be used in combination with the INPUT element, type="file". accept-charset = charset list [CI] This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received. The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element. accept = content-type-list [CI] This attribute specifies a comma-separated list of content types that a server processing this form will handle correctly. User agents may use this information to filter out non-conforming files when prompting a user to select files to be sent to the server (cf. the INPUT element when type="file"). name = cdata [CI] This attribute names the element so that it may be referred to from style sheets or scripts. Note. This attribute has been included for backwards compatibility. Applications should use the id attribute to identify elements. [/quote] Clearly there's nothing in this normative reference that allows a browser sending numeric any character entities in form data submitted with "application/x-www-form-urlencoded" (format specified in the HTTP reference RFC 1616 for query strings in URLs) or "multipart/form-data" (format specified in the MIME multipart specification), but the last sentence in the paragraph describing the accept-charset attribute contains the "may" word which by normative definition allows a browser to uses another charset than the one suggested in the accept-charset attribute of the <form> element...
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 09:28:46 EST