Re: How-To handle i18n when you don't know charset?

From: Jim Saunders (wjs@netscape.com)
Date: Thu Jul 06 2000 - 15:40:12 EDT


Leon Spencer wrote:

> If I'm dealing with e-mail (POP3 and SMTP), do I necessarily
> want to respond to the user in the same charset as their original
> message to me? So far, I've convinced myself of No.

More of a user behavior issue which I don't have any strong opinion on.
Comments below were focused on "knowing" what the encoding is, some
protocols explicitly state the encoding which makes it a non issue, and
there are other options such as allowing the user to set the encoding which
is also viable in some applications.

>
>
> Leon
>
> > -----Original Message-----
> > From: wjs@netscape.com [mailto:wjs@netscape.com]
> > Sent: Thursday, July 06, 2000 12:12 PM
> > To: Unicode List
> > Cc: Unicode List
> > Subject: Re: How-To handle i18n when you don't know charset?
> >
> >
> > Here are some general guidelines you might like to consider :-
> >
> > o Have the UI layer pass a tag identifying the character
> > encoding unless the
> > UI layer maps the data to one of the Unicode representations
> > (UTF-8, UTF-16)
> > before passing the data on.
> >
> > o Have the UI layer pass a tag identifying the locale
> > (language+region). You'll
> > need this if your back end does any locale sensitive
> > operations such as sorting
> > and is independent of the encoding issue.
> >
> > o Have all the pages generated include a META-CHARSET tag in the HTML
> > Header. This will insure that the browser(s) submit form post
> > data in the same
> > encoding as the original html page. May be the source of your
> > original problem.
> >
> > Jim
> >
> > Mike Brown wrote:
> >
> > > > What is the best way to handle i18n when you are passed a
> > string and
> > > > you don't know the charset? I assume iso-8859-1 when I
> > don't know the
> > > > charset BUT on some Spanish environments my data is coming out
> > > > garbage. It seems some of the characters are coming from
> > iso-8859-2
> > > > (at least that's my first look).
> > > >
> > > > My component that handles processing of the string data
> > is separate
> > > > from the GUI where the user enters the data. The UI
> > doesn't pass me
> > > > any charset information.
> > >
> > > Is the GUI collecting data through an HTML form? Browsers
> > are intentionally
> > > disregarding the recommendations and sending form data
> > without charset
> > > information "to keep old scripts from breaking". That's the
> > argument I
> > > heard, anyway. I'm fighting this battle, myself. What is
> > the receiving end
> > > to do? I can tell you what I came up with.
> > >
> > > If the browser is IE4 or IE5, there is an undocumented MS
> > DHTML property,
> > > document.charset, which will tell you what charset the
> > browser used to
> > > interpret the bytes of the HTML document, and in IE4/IE5's
> > case, this will
> > > also be the charset used in the form submission. Here's an
> > HTML snippet I
> > > use to report what the browser is assuming the HTML
> > document's charset is:
> > >
> > > <script type="text/javascript"><!--
> > > if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 &&
> > > (parseInt(navigator.appVersion) >= 4) ) { document.write(
> > "<p><tt>Since you
> > > are using IE 4.0 or higher and do not have scripting
> > disabled, I can tell
> > > that this generated HTML document is being intepreted by
> > the browser as <u>"
> > > + document.charset + "</u> and that the browser's default
> > encoding happens
> > > to be <u>" + document.defaultCharset + "</u>.</tt></p>" ); }
> > > //--></script>
> > >
> > > In theory you could pass this as a hidden parameter in the
> > form dataset and
> > > then the receiving application can know to look for it.
> > However this will
> > > require being able to re-scan the bytes in the form data
> > part of the HTTP
> > > message so that they can be properly interpreted, so a
> > typical one-pass HTTP
> > > servlet will not suffice. I'm not sure how it works in IE3
> > although I read
> > > that the charset for form data submissions will be
> > determined by the OS's
> > > locale in that browser. Netscape Navigator 4.x is no
> > better. Haven't tested
> > > Mozilla.
> > >
> > > Regardless of the browser, you could also examine the
> > Accept-Language HTTP
> > > header, the highest priority value in which you can take
> > and map to a
> > > *likely* charset by relying on your environment's Locale
> > resource bundles
> > > (Java Servlet Programming, pages 380-394) and a table of
> > fallback mappings.
> > > However this approach makes some really bad assumptions is
> > at best a stab in
> > > the dark.
> > >
> > > Some applications just outright put a select box in the
> > form and rely on the
> > > user to pick the language they're using. This still makes
> > some assumptions,
> > > though, because as you pointed out with Spanish, there's
> > not always a single
> > > charset for each language.
> > >
> > > > Since I'm in a Java environment, isn't there be a way to go
> > > > to UTF-8 and from UTF-8 determine the corresponding ISO
> > > > (and other) charset?
> > >
> > > No, there's nothing special about UTF-8 in this instance.
> > You're dealing
> > > with a mystery sequence of bytes. You know they represent
> > characters, but
> > > you don't know how the mappings work. Is it a one-to-one
> > mapping of bytes to
> > > characters, or are some bytes taken 2, 3 or 4 at a time?
> > You don't even know
> > > that much. Which bytes or byte sequences map to which
> > characters? UTF-8 a
> > > charset that maps 1 to 6 bytes to a character; ISO-8859-x
> > is a charset that
> > > maps 1 byte to a character. (Before someone corrects me,
> > I'm using the
> > > definition of charset as per UTR #17, and yes, I realize
> > that charsets have
> > > bytes that map to non-characters.)
> > >
> > > Once you assume a charset, the only way you're going to
> > know whether it was
> > > the right choice, aside from recognizing invalid byte
> > sequences for certain
> > > charsets like UTF-8 and UTF-16[BE/LE], is when you look at
> > the characters
> > > you got and say "hey that's not what I was expecting". So
> > the only solution
> > > seems to me to be to know precisely what you are expecting
> > to receive (known
> > > character sequences), and what those sequences look like as
> > byte sequences
> > > in different encodings.
> > >
> > > I think the only way to do it right is to come up with some
> > fixed strings in
> > > various language scripts that you can pass as hidden
> > parameters, examine the
> > > bytes that come through, and look them up in a custom
> > mapping table that
> > > will deduce the charset based on the byte sequences
> > received. I have not yet
> > > taken the time to figure out what strings I can send that
> > will unambiguously
> > > identify each charset; it's been difficult enough just
> > finding all the
> > > references to what the charsets actually look like. If
> > anyone has done this
> > > already I'd like to hear about it. One problem with this
> > approach is that if
> > > the browser misinterprets the HTML document's charset, the
> > bytes for your
> > > magic strings may have gone through 2 layers of corruption
> > by the time they
> > > get to your application.
> > >
> > > - Mike
> > > ____________________________________________________________________
> > > Mike J. Brown, software engineer at My XML/XSL resources:
> > > webb.net in Denver, Colorado, USA http://www.skew.org/xml/
> >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT