Re: It was IE (re: Unicode SGML entities in application/x-www-form-urlencoded)

From: Adrian Havill (havill@threeweb.ad.jp)
Date: Mon Apr 20 1998 - 02:29:25 EDT


I wrote:
>On an U.S. version, both the "Nihongo" and the Macroned "A" will get
>SGML entitified (into three entities), because neither is in the "native"
>character set.

Chris Wendt replied:
> This part is not correct analysis. The encoding of the FORM data does not
> depend on the language of the browser or the language of the OS but on the
> charset that IE recognizes the page to be encoded in.

My error (embarrassing typo: "iso-8859-1", not "iso-8869-1"). Apologies for the
inaccurate post.

> If the submission is in Shift-JIS, some of the Latin1 8-bit
> characters get remapped to the "nearest" us-ascii, unaccented character.
> Honestly I don't know whether it would be better to always and consistently
> retain the true Latin1 character in &#nnnnn; notation. I personally tend in
> your direction, better retain the original values.
> I am willing to hear other opinions on this.

Parsers (like ours) that use the accented characters to autoconvert latin to
kana won't be able to use the accented "aeiou" to specify long vowels in
Japanese (although the romanization is actually ambiguous, it's easier to read
phonetic kana than it is to read romaji for Japanese), although it could use the
Latin-A macroned vowels, with are used in Hepburn-style romanized Japanese...
although many more systems support Latin-1 as opposed to the Latin-A characters.

> >is the sending of Unicode that's not in the ASCII subset as
> >SGML numeric generic entities in application/x-www-form-urlencoded
> >(which is not designed to handle anything other than ASCII) to be
> >expected from here on?

> HTML4 does not give a recommendation what to do with characters that still
> don't fit in _any_ of the FORM's accept-charset listed charsets or what to
> do if there are none listed.

Actually, if none are listed, I believe it defaults to "UNKNOWN", and the
user-agent is supposed to use the "...character set used to transmit the
document." Which IE seems to do.

> Safe, but inconvenient, would be if user agent
> prevents any input that would not fit. I am not guaranteeing that future
> versions of Internet Explorer will indeed prevent non-fitting input,

Please don't hard code the ability out, as that functionality can be controlled
via either client or server side scripts.

=====

> so for the time being I do recommend [specifying UTF-8 as the FORM page's
> document charset]

I hope in the future I can use nothing but UTF-8, but there are still many 3rd
generation browsers out there that don't do Unicode, so this isn't an option for
stuff outside of the firewall yet.

> coding for the &#nnnnn; method or

This is a clever hack on IEs part to get non-ASCII Unicode into the
non-I18N-ready application/x-www-form-urlencoded and still remain somewhat
compatible with 95% of most CGI scripts.

However, it is a major change in behavior compared to third generation browsers,
and it _does_ break many CGI scripts designed for Japanese that relied on the
old behavior (receiving Shift-JIS/EUC/ISO-2022-JP no matter what the original
document was in). In particular, it breaks the "Japanese module" (jcode.pl) for
Perl which is used in about 90% of all Perl based Japanese CGI scripts. Anyone
attempting to enter Japanese into an ISO-8859-1 form will run into this
"feature".

=====

A lot of character-encoding code inside CGIs needs to be updated (a minor
update, but...), and I was wondering if this is a feature that will be appearing
on other browsers... in other words, will Unicode escaped in SGML entity form
(&#xxxxx;) become a standard way to extend application/x-www-form-urlencoded
(which really shouldn't be used once multipart/form-data is universal, but it is
and won't go away for a while until all browsers can handle multipart/form-data)
to handle characters that don't "fit"?



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT