URLEncode international characters

From: Raghu Kolluru (raghu.kolluru@dig.com)
Date: Mon Oct 02 2000 - 18:11:48 EDT


I have an simple servlet which gets the form fields and stores in a sql
server db. Now I am trying to store and retrive international characters
(charset EUC-JP).

The problem I am having here is:
For the first time when I send the characters, java gets it as ascii, It
returns back to the browser (IE 5.5) some junk, now here is the interesting
thing, I append the same characters to the junk and submit it. Now the later
text appears fine in the browser.

Question:
I am thinking that first time the browser encodes the text in ascii, then
later it encodes it properly. Is there anyway that I can solve this? Any
help is greatly appreciated.

Raghs

> -----Original Message-----
> From: Raghu Kolluru
> Sent: Monday, October 02, 2000 10:52 AM
> To: 'addison@inter-locale.com'
> Subject: RE: Major site in unicode?
>
>
> Great! Thanks.
>
> > -----Original Message-----
> > From: addison@inter-locale.com [mailto:addison@inter-locale.com]
> > Sent: Monday, October 02, 2000 10:24 AM
> > To: Unicode List
> > Cc: Unicode List
> > Subject: RE: Major site in unicode?
> >
> >
> > It knows because:
> >
> > 1. You sent the page in that character set, or;
> > 2. You embedded a token in the page to tell the CGI program what the
> > character set was, or;
> > 3. You used the (IE only) hack to get the browser to embed it
> > in a hidden
> > field, or;
> > 4. You guessed it based on a heuristic (or from the user's session
> > information, maintained in the URL or in a cookie).
> >
> > This sounds complex, but it isn't all that bad. Very few
> users will be
> > foolish enough to change their display encoding to something
> > that displays
> > the page incorrectly...
> >
> > Actually, all this talk of "setting browser to Unicode" and
> > "setting the
> > browser to code page" is based on a poor assumption or set of
> > assumptions. What's getting set is the character encoding of
> > the HTML page
> > itself. If done correctly, the browser will read it from the
> > HTTP header
> > and(or) the META tag.
> >
> > The current best practice for creating multilingual capable
> web sites
> > (even if they happen to be mono-lingual at any one URL) is to
> > use Unicode
> > (either UTF-8 or UTF-16, depending on your operating
> > environment) internally at the server. A decision can be made
> > to deliver
> > either UTF-8 or a non-Unicode legacy encoding at page
> > delivery time. At
> > this point in time, most pages are NOT delivered as UTF-8,
> > even though the
> > server-side systems are entirely Unicode, because of the
> > problems cited
> > earlier with older Netscape and IE browsers and their still
> relatively
> > large market share.
> >
> > Choosing this architecture allows you to construct
> single-source code
> > systems, access databases and data warehouses, and build
> > applications in a
> > locale independent way. This vastly simplifies maintenance,
> > testing, and
> > deployment compared to legacy charset systems.
> >
> > ... many programmers, of course, would like to eliminate the
> > complexity of
> > the charset conversion at delivery time, and this day is
> > coming. I suggest
> > that you parse UserAgent strings at the start of a session
> > with a user and
> > determine if UTF-8 can be sent to the browser (it can in the
> > majority of
> > cases and the vast majority of Western and Eastern European
> > cases: Asian
> > locales are the big hangup here), and set the result into
> the session
> > (see #4 above).
> >
> > Hope this helps.
> >
> > Addison
> >
> > ===========================================================
> > Addison P. Phillips Principal Consultant
> > Inter-Locale LLC http://www.inter-locale.com
> > Los Gatos, CA, USA mailto:addison@inter-locale.com
> >
> > +1 408.210.3569 (mobile) +1 408.904.4762 (fax)
> > ===========================================================
> > Globalization Engineering & Consulting Services
> >
> > On Mon, 2 Oct 2000, Raghu Kolluru wrote:
> >
> > > > >> I assume that "the ISO standard" refers to ISO/IEC 8859-1 and
> > > > >> possibly 8859-2 as well. Unicode is an ISO standard
> > too (ISO/IEC
> > > > >> 10646-1).
> > > > >
> > > > > So if my browser is set to ISO 8859-1 or ISO
> 8859-2, but a
> > > > > Central Euopean or Western European site is only in
> > > > Unicode, then all
> > > > > will show up correctly?
> > > >
> > > > If your browser is old enough that it can only be "set
> > to" a single
> > > > character set, and this setting cannot be overridden by a
> > "charset=X"
> > > > tag in the HTML page, then no, it will not be displayed
> > > > correctly. But
> > > > this sort of rigidity is not present in modern browsers.
> > >
> > > How does the CGI program know that the data submitted is of
> > "charset=EUC-JP"
> > > ?
> > >
> > > Raghu Kolluru, Software Engg.
> > > GO.com | Walt Disney Internet Group
> > > 206-664-4267 | raghu.kolluru@dig.com
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Doug Ewell [mailto:dewell@compuserve.com]
> > > > Sent: Sunday, October 01, 2000 11:48 PM
> > > > To: Unicode List
> > > > Subject: Re: Major site in unicode?
> > > >
> > > >
> > > > >> I assume that "the ISO standard" refers to ISO/IEC 8859-1 and
> > > > >> possibly 8859-2 as well. Unicode is an ISO standard
> > too (ISO/IEC
> > > > >> 10646-1).
> > > > >
> > > > > So if my browser is set to ISO 8859-1 or ISO
> 8859-2, but a
> > > > > Central Euopean or Western European site is only in
> > > > Unicode, then all
> > > > > will show up correctly?
> > > >
> > > > If your browser is old enough that it can only be "set
> > to" a single
> > > > character set, and this setting cannot be overridden by a
> > "charset=X"
> > > > tag in the HTML page, then no, it will not be displayed
> > > > correctly. But
> > > > this sort of rigidity is not present in modern browsers.
> > > >
> > > > >> The browser you are thinking of is Netscape Navigator
> > (pre-4.7).
> > > > >> Support for Unicode in all browsers is improving steadily,
> > > > and as it
> > > > >> does, your 'adamant' programmers will end up using
> > Unicode-encoded
> > > > >> sites without even realizing it.
> > > > >
> > > > > When? 5 years from now? As for using Unicode
> > without realizing
> > > > > it, what do you mean? If a Russian's browser is set to
> > CP1251, what
> > > > > happens if the site is in Unicode? At present he gets
> > > > garbage. I've
> > > > > tried the setting that automatically changes to the
> > character set of
> > > > > the page. Doesn't work very well. I think the character set
> > > > > indication gets left out in many sites.
> > > >
> > > > Browsers are supposed to be able to switch automatically to the
> > > > character set used by the target page, but they cannot
> > necessarily do
> > > > this blindly by auto-detecting the character set. It is
> > > > supposed to be
> > > > indicated by the page using the "charset=X" tag. Sites
> > that do not do
> > > > this are not giving browsers a fair chance to display the page
> > > > properly. This is not the fault of Unicode or the
> > browser, but of the
> > > > HTML author.
> > > >
> > > > > I don't disagree with this. It's just at present
> > > > moment, Netscape
> > > > > and Explorer don't seem ready. What would really be
> > needed is the
> > > > > browser automatically detects the site as being in
> Unicode, and
> > > > > switches to that character set. Then sites could switch
> > > > over without
> > > > > worry. That is not the case at the moment. So the
> user has to
> > > > > change the character set himself.
> > > >
> > > > Try using a recent version of your favorite browser (IE
> > version 5.0 or
> > > > above, or NN version 4.7 or above).
> > > >
> > > > I think the real problem here is that you, your team, and
> > your users
> > > > in Russia are working with older versions of software
> that did not
> > > > properly handle Unicode, and are assuming that newer
> > versions will not
> > > > support Unicode either. Thankfully, this is not the case.
> > > >
> > > > -Doug Ewell
> > > > Fullerton, California
> > > >
> > >
> >
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT