RE: Unicode in a URL

From: Paul Deuter (Paul.Deuter@plumtree.com)
Date: Thu Apr 26 2001 - 12:16:42 EDT


Thanks Addison.

I appreciate that the UTF-8 solution is the "right" one.
However we must acknowledge that this "right" solution does not
appear to be implemented in anywhere. And I have come to the
conclusion that it also will not be.

The reason is the one that you mentioned: because the %XX format
is already being used by browsers as a generic way to character
data in any character set. A server cannot assume UTF-8 when
it knows that there is already a common practice of using this
format for other character sets.

I stumbled onto the %uXXXX format by accident and was happy to
find that IIS 5.0 decoded it correctly. It is not too surprising
though - because it seems that the need for an "unequivocal" way
of encoding a character would be tremendous. After all, that is
the whole reason that Unicode exists in the first place.

I am wondering if there isn't a need for the Unicode Spec to also
dictate a way of encoding Unicode in an ASCII stream. Perhaps
the %uXXXX is already that and I am just ignorant. Another
alternative would be to use the U+XXXX format that already has
been in wide use in the literature.

-Paul

Paul Deuter
Internationalization Manager
Plumtree Software
paul.deuter@plumtree.com <mailto:Paul.deuter@plumtree.com>
 

-----Original Message-----
From: addison@inter-locale.com [mailto:addison@inter-locale.com]
Sent: Wednesday, April 25, 2001 9:20 PM
To: Paul Deuter
Cc: Unicode List (E-mail)
Subject: Re: Unicode in a URL

Actually, your first solution (the W3C recommendation) is the "right" one,
generally speaking. At least you can use the server's native URL parser
in almost all cases to get the bits back. The problem that you're
encountering is not unusual, though.

In particular, IE will by default encode all of the URL *except* the part
after the "?" as UTF-8 for you. The data put there will be encoded as the
encoding of the page itself (as experienced by the browser). NN 4.x and
other browsers generally give you the encoding of the page for the whole
URL.

What you receive on the server side, therefore, is page dependent. How you
interpret it (or, rather, how the server interprets it) generally depends
on the page being retrieved and/or the configuration of the server
itself. The best way to handle it (from a reliability point of view) is to
use UTF-8 for everything and to reinterpret the URL using code. The idea
most servers (like Apache, for example) use is that the page being
retrieved is somehow similar to the request for the page, e.g. the page is
in the same encoding. This is not unreasonable, but again it implies that
everything is UTF-8.

Of course, you get to encounter the famous 4.x browser font problem in
Asian locales, so it's a tradeoff there.

Hope that helps.

Addison

===============================================================
Addison P. Phillips Globalization Architect
webMethods, Inc http://www.webmethods.com
Sunnyvale, CA, USA mailto:aphillips@webmethods.com

+1 408.210.3569 (mobile) +1 408.962.5487 (ofc)
===============================================================
"Internationalization is not a feature. It is an architecture."

On Wed, 25 Apr 2001, Paul Deuter wrote:

> I am struggling to figure out the correct method for encoding Unicode
> characters in the
> query string portion of a URL.
>
> There is a W3C spec that says the Unicode character should be converted to
> UTF-8 and
> then each byte should be encoded as %XX. From my experience however,
> browsers will
> encode all character sets this way and IIS at least will interpret such
hex
> bytes according
> to the character set that is set on the receiving page. That is to say,
the
> target page will
> read the query string and these hex bytes may be interpreted as ISO-8859-1
> or Big5 or
> Shift-JIS depending on the target page.
>
> With IIS 5.0, I have stumbled onto the solution of using %uXXXX where XXXX
> is the
> hexadecimal value of the Unicode character. When I pass Unicode data
> formatted this way on
> Windows 2000/IIS5 - the data always seems to be decoded properly.
> (Apparently this
> format came from ECMAScript.)
>
> I don't particularly like the %uXXXX format (primarily because it does NOT
> work on NT 4.0 - IIS 4.0)
> and I doubt that it would work at all well on other web servers. Does
> anyone know of an encoding
> method that will actually be properly decoded by a variety of web servers?
>
> Thanks in advance
> -Paul
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT