Re: German characters not correct in output webform

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jan 14 2004 - 09:59:08 EST

  • Next message: Peter Kirk: "Re: German characters not correct in output webform"

    Yahoo! Groups : aramaic Post MessageFrom: Peter Kirk
    > Well, attached is a Yahoo groups form (saved by my browser) similar to
    > the one which caused me problems.

    The "Reply" form in Yahoo Groups is coded in "windows-1252".
    It uses the following form declaration:

    <form method="post" action="/group/...">

    If it had not indicated a value for the "method" attribute, the submission
    method would have been "GET" by default. Instead, here, the submission will
    be a POSTed entity.

    It does not indicate a value for the missing "enctype" attribute, so the
    form data is encoded using "Content-Type: application/x-www-form-urlencoded"
    (the browser should indicate this Content-Type: header explicitly for the
    POSTed entity).

    It does not indicate a value for the missing "accept-charset" attribute, so
    the browser is expected to use accept-charset="UNKNOWN", and, as specified
    in the HTML reference, the browser "may" use the charset used on the HTML
    form page, i.e. "windows-1252". The browser is still not allowed to encode
    non-Windows-1252 characters that is part of the form data using numeric
    character entities (this is not supported by the
    "application/x-www-form-urlencoded" content-type, which just consists in
    creating a "&"-separated list of "name=value" pairs, where the bytes
    encoding the characters present in each "name" or "value" and are not URL
    safe, should be coded with %XX triplets for each such coding byte.

    The indicated submission format does not allow sending something else than
    windows-1252 characters, and so any character in the form data which does
    not exist in this charset should be detected and rejected by the browser,
    which should ask the user to modify the form data or to accept that some
    characters will be replaced by '?' once converted to windows-1252. The other
    solution would be to use a UTF-8 encoding (which is the one recommanded for
    URLs) instead of windows-1252 prior to performing the URL-encoding (this is
    what should be done, but the missing "accept-charset" attribute which means
    UNKNOWN is not clear about what should be done by browsers, notably because
    the GET method does not allow specifying explicitly the charset used to
    create the URL-encoded query string). But as we are using a POST method,
    there'a an attached entity with the HTTP POST request.

    This entity created by the browser should then specify the charset actually
    used:
        Content-Type: application/x-www-form-urlencoded; charset=windows-1252
    (if it uses the suggestion given by the HTML4 reference of using the same
    charset as the HTML page), or:
        Content-Type: application/x-www-form-urlencoded; charset=UTF-8

    Then the entity body should consist in the URL-encoding of a "&" separated
    list of "name=value" pairs encoded with the charset indicated by the
    Content-Type header above.

    If the browser chooses the first suggestion, then it won't be able to encode
    any non-Windows-12532 character. But the browser can still use the second
    solution without even needing any numeric character entities (which are only
    needed within XML/HTML/HTML documents, but have no meaning in a
    "application/x-www-form-urlencoded" document.

    There is NOTHING in this form that allows a browser to use a numeric
    character entity "&#601;". This is true even if the form data present in the
    HTML form page was feeded with numeric characters entities like "&#601;"
    which are supposed to encode a character and not the 6-character strings
    "&#601;".

    Note that the Yahoo reply form uses this element to feed the reply text:
    <textarea name="message" rows=20 cols=70 wrap="hard">content of the
    message</textarea>
    where the "cntent of the message" will contain probably numeric or named
    character entities like "&gt; " at the beginning of each quoted line in the
    initial reply text. If there's a "&#601;" there, it means a single character
    that is part of the displayed initial text, and that you browser should
    display correctly within the rendered form. However if your browser will
    submit the form using the "windows-1252" charset, it won't be able to send
    it correctly as the submission format is
    "application/x-www-form-urlencoded". So the browser should either ask to the
    user to edit the message until this non-windows-1252 is removed or replaced,
    or it should ask the user the permission to replace it with "?".

    If your browser silently encodes it with a numeric character reference, this
    violates all standards. In this case, this is a bug in the browser, which
    should have better used silently the UTF-8 encoding if the browser does not
    want to bother the user with the permission to replace characters with "?".
    The alert prompt however should be displayed by the browser before
    submitting the form data, if the form had specified an "accept-charset"
    attribute specifying the "windows-1252" charset explicitly and exclusively
    without allowing "UTF-8" (because in this case the browser will not have the
    "UNKNOWN" default value for the missing "accept-charset" attribute, which
    is, in my opinion, the only case where a charset suggested by the HTML form
    page encoding may be silently replaced by another, preferably UTF-8 as it
    keeps all characters present in the form data).



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 10:40:08 EST