RE: Unicode on a website

From: addison@inter-locale.com
Date: Fri Sep 22 2000 - 13:20:32 EDT


There is a furious debate about whether it is time to send UTF-8 all the
way to the browser. What Tim has outlined is exactly the problem: Netscape
on Windows and older IE browsers do not use the correct font for
non-Western European languages in their default configuration. Mozilla and
IE5 work if the user has an appropriate array of fonts installed (there is
no need for a "Unicode" font, but it helps a lot).

In terms of performance, James Kass was partially right: generally
speaking, Unicode requires more storage than non-Unicode legacy character
sets. But the expansion is not as drastic as he makes out.

Your putative database is probably only about 10-20% text. If you are
using MS SQL Server as your database, the storage requirement doubles for
Western European languages and increases by about 8% (of the 10 to
20%) for Asian languages. If you are using a database like Oracle that
uses UTF-8 as its internal encoding, the increase is much smaller for
European languages, a little more for non-Latin alphabetics, and about 50%
for CJK languages.

In real terms, though, your website is:

1. About 40% HTML tags, which have no expansion, regardless of language
(in UTF-8 or in a page template).
2. Made up of pages which have a finite size, so the performance hit is
measurable, but probably undetectable (from a user point of view) on any
specific page.

Even if you have to move enormous amounts of data, you will benefit
greatly from using Unicode, because it allows you to reliably store and
retrieve data, regardless of what character set and language you are
actually delivering/receiving on the wire. You can write code that assumes
that one-and-only-one character set is in use internally and you won't
have to manage a whole bunch of (memory and processor-cycle hungry) code
set converters.

In my experience, you will get more performance boost from known
scalability tactics (such as caching and load balancing) than you will
lose to data storage issues associated with Unicode.... and your site will
work in different languages and countries without having to modify your
data storage scheme or convert all of your databases.

I hope this helps.

Best Regards,

Addison



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT