Re: UTF-8 code in HTML

From: addison@globalsight.com
Date: Wed Apr 12 2000 - 11:38:16 EDT


A few points on this discussion.

First, UTF-16 and UTF-32 are not good choices for the delivery of HTML pages. Due to the byte-ordering problem (and the fact that these are NOT subset of ASCII character sets), they are unlikely to receive wide support or be widely used on the 'Net. UTF-8 is the recognized standard for delivering (Unicode) HTML (and just about everything else on the Internet) and I doubt that this will change in the short term.

Second, Markus is "right" about HTML being "self-describing", but this overlooks a few things. If your page contains only the characters in the Latin-1 character set, then using UTF-8 will work correctly with about 97% of the browsers as they are configured at install time. So UTF-8 is a fine choice for serving pages that could be encoded in 8859-1. But if you want to use multiple scripts in a single page (say, Polish and Japanese), then things begin to break down.

The fact that the browser can correctly decode the UTF-8 is not at issue. The problem is that IE4 and NN4 allow only one font to be associated with the UTF-8 encoding (in the user interface)... and the default is a Latin-1 font. My speculated Polish-Japanese page contains a few "black squares" in the Polish and all black squares in the Japanese. A Russian-Japanese page would be entirely black squares.

Now, I know that I need to change fonts, but most users either don't know what to do or are annoyed by it. Commercial sites therefore do not typically serve UTF-8, even though many of them use Unicode encodings on the back end for data storage.

It is partially possible to overcome the aforementioned problems by using FONT tags and CSS, but CSS implementations and support differ (and complex or bidi scripts have other problems). So, at this point, I am intrigued by the increasing market share of IE5 and the possibility of a commercial release of Mozilla as Netscape 5.0. Increasing marketshare of these products could result in the majority of the Web audience being able to effectively view Unicode encoded pages outside the Latin-1 space (or multiscript pages) sometime next year.

It makes sense to use UTF-8 to assemble Web pages on the back end (from your content management system, database, and static pages) and translate the page to the target character set for the page locale *if* the browser is older [ less than version 5.0 ], or unrecogized. In the past I would have said: always translate the page.

Third, automagic detection has limitations and relying on it is poor design. It is possible to provide the server with per-directory and other configuration information to help it determine the charset of the pages it is serving (or use UTF-8 for everything and provide an API module to tranlate the pages late in cycle). It is a performance hog to make the poor server actually read all of the pages to find a (possibly non-existant) META tag.

Thanks,

Addison


@
Sent by: Yves Arrouye <yves@realnames.com>
04/11/2000 09:05 PM

To: "Unicode List" <unicode@unicode.org>
cc:
bcc:
Subject: Re: UTF-8 code in HTML


>> If I have 3 H T M L files side-by-side in a directory, one in U T F
>> 8, another in, say, big-endian Unicode, and a third in shift-JIS,
>> there is no way they can be self describing, because in order to
>> parse the H T M L, you have to understand the encoding already.

Most encodings commonly used are a superset of ASCII, and thus one can
safely reach the point where a meta tag for the content type can be parsed.
This meta tag is in ASCII itself. So for these encodings, there is no work
to be done by the parser, and the author can use the appropriate meta tag to
make its document self-describing.

Today, very few people publish HTML documents encoded in UCS-2, UCS-4,
UTF-16 or UTF-32. When they do, the readers of these documents need to be
able to recognize these encodings since they are not supersets of ASCII.
Recognizing these is trivial, and some browsers do it. If you want to avoid
being at the mercy of browser's recognition of encodings, UTF-8 is an
appropriate encoding that is a superset of ASCII.

Modern browsers implement some sort of automatic encoding detection anyway
(amazing, the number of Japanese Web pages without any charset information;
some even play tricks to force the recognition of a given charsets: for
example, Yahoo! includes a comment with a byte sequence that only exists in
EUC-JP in order to "help" the browser recognize the encoding as such very
early). Why they usually can't help you if you have files in iso-8859-1, -2
and -15, they're still very useful. And if you indicate the encoding of your
files in the files themselves, you'll be very safe (still, caveat for 16 or
32 bits eencodings today).

YA.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT