Re: How-To handle i18n when you don't know charset?

From: Jim Saunders (wjs@netscape.com)
Date: Thu Jul 06 2000 - 15:24:42 EDT


Here are some general guidelines you might like to consider :-

o Have the UI layer pass a tag identifying the character encoding unless the
UI layer maps the data to one of the Unicode representations (UTF-8, UTF-16)
before passing the data on.

o Have the UI layer pass a tag identifying the locale (language+region). You'll
need this if your back end does any locale sensitive operations such as sorting
and is independent of the encoding issue.

o Have all the pages generated include a META-CHARSET tag in the HTML
Header. This will insure that the browser(s) submit form post data in the same
encoding as the original html page. May be the source of your original problem.

Jim

Mike Brown wrote:

> > What is the best way to handle i18n when you are passed a string and
> > you don't know the charset? I assume iso-8859-1 when I don't know the
> > charset BUT on some Spanish environments my data is coming out
> > garbage. It seems some of the characters are coming from iso-8859-2
> > (at least that's my first look).
> >
> > My component that handles processing of the string data is separate
> > from the GUI where the user enters the data. The UI doesn't pass me
> > any charset information.
>
> Is the GUI collecting data through an HTML form? Browsers are intentionally
> disregarding the recommendations and sending form data without charset
> information "to keep old scripts from breaking". That's the argument I
> heard, anyway. I'm fighting this battle, myself. What is the receiving end
> to do? I can tell you what I came up with.
>
> If the browser is IE4 or IE5, there is an undocumented MS DHTML property,
> document.charset, which will tell you what charset the browser used to
> interpret the bytes of the HTML document, and in IE4/IE5's case, this will
> also be the charset used in the form submission. Here's an HTML snippet I
> use to report what the browser is assuming the HTML document's charset is:
>
> <script type="text/javascript"><!--
> if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 &&
> (parseInt(navigator.appVersion) >= 4) ) { document.write( "<p><tt>Since you
> are using IE 4.0 or higher and do not have scripting disabled, I can tell
> that this generated HTML document is being intepreted by the browser as <u>"
> + document.charset + "</u> and that the browser's default encoding happens
> to be <u>" + document.defaultCharset + "</u>.</tt></p>" ); }
> //--></script>
>
> In theory you could pass this as a hidden parameter in the form dataset and
> then the receiving application can know to look for it. However this will
> require being able to re-scan the bytes in the form data part of the HTTP
> message so that they can be properly interpreted, so a typical one-pass HTTP
> servlet will not suffice. I'm not sure how it works in IE3 although I read
> that the charset for form data submissions will be determined by the OS's
> locale in that browser. Netscape Navigator 4.x is no better. Haven't tested
> Mozilla.
>
> Regardless of the browser, you could also examine the Accept-Language HTTP
> header, the highest priority value in which you can take and map to a
> *likely* charset by relying on your environment's Locale resource bundles
> (Java Servlet Programming, pages 380-394) and a table of fallback mappings.
> However this approach makes some really bad assumptions is at best a stab in
> the dark.
>
> Some applications just outright put a select box in the form and rely on the
> user to pick the language they're using. This still makes some assumptions,
> though, because as you pointed out with Spanish, there's not always a single
> charset for each language.
>
> > Since I'm in a Java environment, isn't there be a way to go
> > to UTF-8 and from UTF-8 determine the corresponding ISO
> > (and other) charset?
>
> No, there's nothing special about UTF-8 in this instance. You're dealing
> with a mystery sequence of bytes. You know they represent characters, but
> you don't know how the mappings work. Is it a one-to-one mapping of bytes to
> characters, or are some bytes taken 2, 3 or 4 at a time? You don't even know
> that much. Which bytes or byte sequences map to which characters? UTF-8 a
> charset that maps 1 to 6 bytes to a character; ISO-8859-x is a charset that
> maps 1 byte to a character. (Before someone corrects me, I'm using the
> definition of charset as per UTR #17, and yes, I realize that charsets have
> bytes that map to non-characters.)
>
> Once you assume a charset, the only way you're going to know whether it was
> the right choice, aside from recognizing invalid byte sequences for certain
> charsets like UTF-8 and UTF-16[BE/LE], is when you look at the characters
> you got and say "hey that's not what I was expecting". So the only solution
> seems to me to be to know precisely what you are expecting to receive (known
> character sequences), and what those sequences look like as byte sequences
> in different encodings.
>
> I think the only way to do it right is to come up with some fixed strings in
> various language scripts that you can pass as hidden parameters, examine the
> bytes that come through, and look them up in a custom mapping table that
> will deduce the charset based on the byte sequences received. I have not yet
> taken the time to figure out what strings I can send that will unambiguously
> identify each charset; it's been difficult enough just finding all the
> references to what the charsets actually look like. If anyone has done this
> already I'd like to hear about it. One problem with this approach is that if
> the browser misinterprets the HTML document's charset, the bytes for your
> magic strings may have gone through 2 layers of corruption by the time they
> get to your application.
>
> - Mike
> ____________________________________________________________________
> Mike J. Brown, software engineer at My XML/XSL resources:
> webb.net in Denver, Colorado, USA http://www.skew.org/xml/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT