RE: How-To handle i18n when you don't know charset?

From: Leon Spencer (Leon.Spencer@brightware.com)
Date: Thu Jul 06 2000 - 15:30:43 EDT


If I'm dealing with e-mail (POP3 and SMTP), do I necessarily
want to respond to the user in the same charset as their original
message to me? So far, I've convinced myself of No.

Leon

> -----Original Message-----
> From: wjs@netscape.com [mailto:wjs@netscape.com]
> Sent: Thursday, July 06, 2000 12:12 PM
> To: Unicode List
> Cc: Unicode List
> Subject: Re: How-To handle i18n when you don't know charset?
>
>
> Here are some general guidelines you might like to consider :-
>
> o Have the UI layer pass a tag identifying the character
> encoding unless the
> UI layer maps the data to one of the Unicode representations
> (UTF-8, UTF-16)
> before passing the data on.
>
> o Have the UI layer pass a tag identifying the locale
> (language+region). You'll
> need this if your back end does any locale sensitive
> operations such as sorting
> and is independent of the encoding issue.
>
> o Have all the pages generated include a META-CHARSET tag in the HTML
> Header. This will insure that the browser(s) submit form post
> data in the same
> encoding as the original html page. May be the source of your
> original problem.
>
> Jim
>
> Mike Brown wrote:
>
> > > What is the best way to handle i18n when you are passed a
> string and
> > > you don't know the charset? I assume iso-8859-1 when I
> don't know the
> > > charset BUT on some Spanish environments my data is coming out
> > > garbage. It seems some of the characters are coming from
> iso-8859-2
> > > (at least that's my first look).
> > >
> > > My component that handles processing of the string data
> is separate
> > > from the GUI where the user enters the data. The UI
> doesn't pass me
> > > any charset information.
> >
> > Is the GUI collecting data through an HTML form? Browsers
> are intentionally
> > disregarding the recommendations and sending form data
> without charset
> > information "to keep old scripts from breaking". That's the
> argument I
> > heard, anyway. I'm fighting this battle, myself. What is
> the receiving end
> > to do? I can tell you what I came up with.
> >
> > If the browser is IE4 or IE5, there is an undocumented MS
> DHTML property,
> > document.charset, which will tell you what charset the
> browser used to
> > interpret the bytes of the HTML document, and in IE4/IE5's
> case, this will
> > also be the charset used in the form submission. Here's an
> HTML snippet I
> > use to report what the browser is assuming the HTML
> document's charset is:
> >
> > <script type="text/javascript"><!--
> > if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 &&
> > (parseInt(navigator.appVersion) >= 4) ) { document.write(
> "<p><tt>Since you
> > are using IE 4.0 or higher and do not have scripting
> disabled, I can tell
> > that this generated HTML document is being intepreted by
> the browser as <u>"
> > + document.charset + "</u> and that the browser's default
> encoding happens
> > to be <u>" + document.defaultCharset + "</u>.</tt></p>" ); }
> > //--></script>
> >
> > In theory you could pass this as a hidden parameter in the
> form dataset and
> > then the receiving application can know to look for it.
> However this will
> > require being able to re-scan the bytes in the form data
> part of the HTTP
> > message so that they can be properly interpreted, so a
> typical one-pass HTTP
> > servlet will not suffice. I'm not sure how it works in IE3
> although I read
> > that the charset for form data submissions will be
> determined by the OS's
> > locale in that browser. Netscape Navigator 4.x is no
> better. Haven't tested
> > Mozilla.
> >
> > Regardless of the browser, you could also examine the
> Accept-Language HTTP
> > header, the highest priority value in which you can take
> and map to a
> > *likely* charset by relying on your environment's Locale
> resource bundles
> > (Java Servlet Programming, pages 380-394) and a table of
> fallback mappings.
> > However this approach makes some really bad assumptions is
> at best a stab in
> > the dark.
> >
> > Some applications just outright put a select box in the
> form and rely on the
> > user to pick the language they're using. This still makes
> some assumptions,
> > though, because as you pointed out with Spanish, there's
> not always a single
> > charset for each language.
> >
> > > Since I'm in a Java environment, isn't there be a way to go
> > > to UTF-8 and from UTF-8 determine the corresponding ISO
> > > (and other) charset?
> >
> > No, there's nothing special about UTF-8 in this instance.
> You're dealing
> > with a mystery sequence of bytes. You know they represent
> characters, but
> > you don't know how the mappings work. Is it a one-to-one
> mapping of bytes to
> > characters, or are some bytes taken 2, 3 or 4 at a time?
> You don't even know
> > that much. Which bytes or byte sequences map to which
> characters? UTF-8 a
> > charset that maps 1 to 6 bytes to a character; ISO-8859-x
> is a charset that
> > maps 1 byte to a character. (Before someone corrects me,
> I'm using the
> > definition of charset as per UTR #17, and yes, I realize
> that charsets have
> > bytes that map to non-characters.)
> >
> > Once you assume a charset, the only way you're going to
> know whether it was
> > the right choice, aside from recognizing invalid byte
> sequences for certain
> > charsets like UTF-8 and UTF-16[BE/LE], is when you look at
> the characters
> > you got and say "hey that's not what I was expecting". So
> the only solution
> > seems to me to be to know precisely what you are expecting
> to receive (known
> > character sequences), and what those sequences look like as
> byte sequences
> > in different encodings.
> >
> > I think the only way to do it right is to come up with some
> fixed strings in
> > various language scripts that you can pass as hidden
> parameters, examine the
> > bytes that come through, and look them up in a custom
> mapping table that
> > will deduce the charset based on the byte sequences
> received. I have not yet
> > taken the time to figure out what strings I can send that
> will unambiguously
> > identify each charset; it's been difficult enough just
> finding all the
> > references to what the charsets actually look like. If
> anyone has done this
> > already I'd like to hear about it. One problem with this
> approach is that if
> > the browser misinterprets the HTML document's charset, the
> bytes for your
> > magic strings may have gone through 2 layers of corruption
> by the time they
> > get to your application.
> >
> > - Mike
> > ____________________________________________________________________
> > Mike J. Brown, software engineer at My XML/XSL resources:
> > webb.net in Denver, Colorado, USA http://www.skew.org/xml/
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT