Twenty-second International Unicode Conference

The Reality of Web Encoding Identification or Lack Thereof

Teruhiko "Kuro" Kurosaka - IONA Technologies

Intended Audience:	Software Engineers, Systems Analysts, Content Developers
Session Level:	Intermediate, Advanced

Encoding detection is key in building well-behaved, robust web applications. The HTTP and HTML standards specify various methods to help identify the character encoding and language of web pages, including the charset attribute in the Content-Type HTTP header, the same information embedded in the META element using the HTTP-EQUIV attribute within the HTML file, and the LANG attribute of the HTML element.

Our research on existing web sites shows, however, that many Web sites do not use these methods properly. We will present our analysis of data on what Web sites are really doing: which method is misused most? Which method is most reliable? We will talk about a few tips to detect inconsistency.

We will discuss how statistics-based encoding detection can be used to combat uncertainty with real life web sites. Finally, we will discuss how to build and deploy a Web site that properly uses the encoding identification methods.

The research focus is on Japanese web sites, but sites of other languages are suveyed as well.

The research focus is on Japanese web sites, but sample data from other locales will also be presented, and the conclusions can be applied to all languages and encodings.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

5 July 2002, Webmaster