RE: How-To handle i18n when you don't know charset?

From: Chris Wendt (christw@MICROSOFT.com)
Date: Thu Jul 06 2000 - 15:13:08 EDT


You can get the charset much easier:

IE5 and later IE fill a field "_charset_" with the charset used for form
submission, regardless of the initial value of this field.
Other browsers will return data in the charset of the <FORM> page and if you
can set the charset of the <FORM> page you can also set this field to
indicate the charset used to the CGI.
Works the same for GET and PUT methods.
IE4 and IE5 will submit characters that do not fit into the charset used for
form submission as HTML numeric character references (&#12345;)

Simplest is to use UTF-8 throughout and label your <FORM> page with it, you
just need to block browsers below version 4 or code specially for them.

-----Original Message-----
From: Mike Brown [mailto:mbrown@corp.webb.net]
Sent: Thursday, July 06, 2000 11:19 AM
To: Unicode List
Subject: RE: How-To handle i18n when you don't know charset?

> What is the best way to handle i18n when you are passed a string and
> you don't know the charset? I assume iso-8859-1 when I don't know the
> charset BUT on some Spanish environments my data is coming out
> garbage. It seems some of the characters are coming from iso-8859-2
> (at least that's my first look).
>
> My component that handles processing of the string data is separate
> from the GUI where the user enters the data. The UI doesn't pass me
> any charset information.

Is the GUI collecting data through an HTML form? Browsers are intentionally
disregarding the recommendations and sending form data without charset
information "to keep old scripts from breaking". That's the argument I
heard, anyway. I'm fighting this battle, myself. What is the receiving end
to do? I can tell you what I came up with.

If the browser is IE4 or IE5, there is an undocumented MS DHTML property,
document.charset, which will tell you what charset the browser used to
interpret the bytes of the HTML document, and in IE4/IE5's case, this will
also be the charset used in the form submission. Here's an HTML snippet I
use to report what the browser is assuming the HTML document's charset is:

<script type="text/javascript"><!--
if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 &&
(parseInt(navigator.appVersion) >= 4) ) { document.write( "<p><tt>Since you
are using IE 4.0 or higher and do not have scripting disabled, I can tell
that this generated HTML document is being intepreted by the browser as <u>"
+ document.charset + "</u> and that the browser's default encoding happens
to be <u>" + document.defaultCharset + "</u>.</tt></p>" ); }
//--></script>

In theory you could pass this as a hidden parameter in the form dataset and
then the receiving application can know to look for it. However this will
require being able to re-scan the bytes in the form data part of the HTTP
message so that they can be properly interpreted, so a typical one-pass HTTP
servlet will not suffice. I'm not sure how it works in IE3 although I read
that the charset for form data submissions will be determined by the OS's
locale in that browser. Netscape Navigator 4.x is no better. Haven't tested
Mozilla.

Regardless of the browser, you could also examine the Accept-Language HTTP
header, the highest priority value in which you can take and map to a
*likely* charset by relying on your environment's Locale resource bundles
(Java Servlet Programming, pages 380-394) and a table of fallback mappings.
However this approach makes some really bad assumptions is at best a stab in
the dark.

Some applications just outright put a select box in the form and rely on the
user to pick the language they're using. This still makes some assumptions,
though, because as you pointed out with Spanish, there's not always a single
charset for each language.

> Since I'm in a Java environment, isn't there be a way to go
> to UTF-8 and from UTF-8 determine the corresponding ISO
> (and other) charset?

No, there's nothing special about UTF-8 in this instance. You're dealing
with a mystery sequence of bytes. You know they represent characters, but
you don't know how the mappings work. Is it a one-to-one mapping of bytes to
characters, or are some bytes taken 2, 3 or 4 at a time? You don't even know
that much. Which bytes or byte sequences map to which characters? UTF-8 a
charset that maps 1 to 6 bytes to a character; ISO-8859-x is a charset that
maps 1 byte to a character. (Before someone corrects me, I'm using the
definition of charset as per UTR #17, and yes, I realize that charsets have
bytes that map to non-characters.)

Once you assume a charset, the only way you're going to know whether it was
the right choice, aside from recognizing invalid byte sequences for certain
charsets like UTF-8 and UTF-16[BE/LE], is when you look at the characters
you got and say "hey that's not what I was expecting". So the only solution
seems to me to be to know precisely what you are expecting to receive (known
character sequences), and what those sequences look like as byte sequences
in different encodings.

I think the only way to do it right is to come up with some fixed strings in
various language scripts that you can pass as hidden parameters, examine the
bytes that come through, and look them up in a custom mapping table that
will deduce the charset based on the byte sequences received. I have not yet
taken the time to figure out what strings I can send that will unambiguously
identify each charset; it's been difficult enough just finding all the
references to what the charsets actually look like. If anyone has done this
already I'd like to hear about it. One problem with this approach is that if
the browser misinterprets the HTML document's charset, the bytes for your
magic strings may have gone through 2 layers of corruption by the time they
get to your application.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at My XML/XSL resources:
webb.net in Denver, Colorado, USA http://www.skew.org/xml/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT