RE: a little more help understanding diacritical encoding

From: Paul Deuter (PaulD@plumtree.com)
Date: Thu Sep 25 2003 - 16:31:20 EDT

Next message: Martin Duerst: "AddDefaultCharset considered harmful (was: Mojibake on my Web pages)"

Previous message: Michael Everson: "Re: About that alphabetician..."
Maybe in reply to: Steve Pruitt: "a little more help understanding diacritical encoding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It would appear that your server side is in Java.
There is a well known issue in older versions of the
Java servlet spec that cause the request class to
assume that %HH encoded octets are 8859-1 octets.
It seems that this is your problem.
The workaround is to get the parameters from the
request object and turn them back into bytes and
then re-interpret them as UTF-8 (because that is
what they are).

The code to do that looks like this:

String strFoo = new String(request.getParameter("whatever").getBytes(8859_1), "UTF-8");

-Paul

-----Original Message-----
From: Steve Pruitt [mailto:SPruitt@exstream.com]
Sent: Thursday, September 25, 2003 9:03 AM
To: unicode@unicode.org
Subject: a little more help understanding diacritical encoding

Thanks for the excellent responses. I now understand how C3 and 89 are derived. I tried getting everything set the way I intrepreted what the list responses said to do. The scenario is:
I have a page with some diacritical characters displayed and a input text box and a submit button. I copy and past one of the displayed characters into the input box and then submit. What is submitted gets echoed back. The pages use style sheets so I cut and pasted the relevant tags, etc.

I thought I found the problem. My response had a character encoding of null. I read null defaults to 8859-1 which seemed consistent with my echoed page. So, I explicitly set the response character encoding to UTF-8 via the setContentType method.

I used a TCP tunneler to see what my request and responses look like. My browser is set to utf-8 also.

From the tunneler my request had the following posted data: v904=%C3%89 this is correct according to how the utf encoding algo was explained.

The http response had the following:

Content-Type: text/html; charset=UTF-8 this is correct.

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> is a child in the <head> tag

<span class="text29">É ê ë í î ï ð ñ ó ô õ ö</span> these are the listed characters on the previous page I cut and past from they are listed on this page just for reference - (#201 = C9) is É.

<span class="text17">Accented Characters from  previous form:  Ã </span>
this is echoed back. #195 = C3 and #137 = 89. These, of course, are displayed as Ã?.

I checked the browser to be sure and its encoding is still set to utf-8 and it is. This is everything I know to check. What am I missing?

Next message: Martin Duerst: "AddDefaultCharset considered harmful (was: Mojibake on my Web pages)"
Previous message: Michael Everson: "Re: About that alphabetician..."
Maybe in reply to: Steve Pruitt: "a little more help understanding diacritical encoding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 17:34:54 EDT