Re: FW: information request; using unicode in HTML form; urlencoded

From: addison@inter-locale.com
Date: Fri Oct 06 2000 - 13:42:57 EDT


Hi Hung Le,

Two quick points:

1. Unicode comes in several encodings. The 16-bit encoding you describe in
your message below (called UCS-2 or UTF-16) is generally considered
inappropriate for use on the Web. There is an 8-bit multibyte encoding
called UTF-8 that is generally more appropriate to the Web. See the FAQ on
the http://www.unicode.org website for more information on these.

2. The web browser actually encodes the bytestream (all bytes) as %xx
values. So a UTF-8 sequence might be two, three, or four %xx groups in a
row, in the same way that you would encode, say, Big5 or Shift-JIS in a
URL (in UTF-8, ASCII is ASCII). Decoding these gives you the original byte
(octet) sequence, which then represents the data values. UTF-16 is encoded
in the same way, although generally the browsers don't support UTF-16
natively and may be confused if you send them that encoding.

Best Regards,

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Fri, 6 Oct 2000, Magda Danish (Unicode) wrote:

>
>
> -----Original Message-----
> From: Hung Le [mailto:hle@comergent.com]
> Sent: Thursday, October 05, 2000 3:21 PM
> To: 'info@unicode.org'
> Subject: information request; using unicode in HTML form; urlencoded
>
>
> Hi,
>
> Our company is exploring the idea of using Unicode in our web pages.
> We ran into a problem that, despite our effort researching for the last two
> weeks, we
> are not able to find an answer. The problem is related to passing text from
> an HTML form to the webserver.
>
> From the user's perspective:
> . we present the user a web page with a form.
> . user fills the form
> . user click on "Submit"
> . the browser post the data entered to the server
>
> From what I can gather so far, the data flow is followed:
> . when the user click on the submit button, the browser
> urlencoded the
> data using the following algorithm:
>
> The ASCII characters 'a' through 'z', 'A' through 'Z', and '0' through '9'
> remain the same.
> The space character ' ' is converted into a plus sign '+'.
> All other characters are converted into the 3-character string "%xy", where
> xy is the two-digit hexadecimal representation of the lower 8-bits of the
> character.
>
> The last rule will clip Unicode charater to an 8-bit
> representation and
> thus the data entered to the HTML form will not make it back to the web
> server.
>
> Have you have experience in this area? How does one capture the data
> in
> an HTML form in Unicode and send it along when user click on the "Submit"
> button?
>
> Thanks for any help you can provide.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT