The Reality of Web Encoding Identification or Lack Thereof
Teruhiko "Kuro" Kurosaka - IONA Technologies
Encoding detection is key in building well-behaved, robust web applications. The HTTP and HTML standards specify various methods to help identify the character encoding and language of web pages, including the charset attribute in the Content-Type HTTP header, the same information embedded in the META element using the HTTP-EQUIV attribute within the HTML file, and the LANG attribute of the HTML element.
Our research on existing web sites shows, however, that many Web sites do not use these methods properly. We will present our analysis of data on what Web sites are really doing: which method is misused most? Which method is most reliable? We will talk about a few tips to detect inconsistency.
We will discuss how statistics-based encoding detection can be used to combat uncertainty with real life web sites. Finally, we will discuss how to build and deploy a Web site that properly uses the encoding identification methods.
The research focus is on Japanese web sites, but sites of other languages are suveyed as well.
The research focus is on Japanese web sites, but sample data from other locales will also be presented, and the conclusions can be applied to all languages and encodings.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
5 July 2002, Webmaster