Unicode and the Web
Q: My web page is in Latin-1 (ISO-8859-1).
So I don't need a charset declaration, right?
A: Wrong. You always need a charset declaration, even when
you are using Latin-1. To quote from the HTML specification:
The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as
a default character encoding when the "charset" parameter is absent
from the "Content-Type" header field. In practice, this recommendation
has proved useless because some servers don't allow a "charset"
parameter to be sent, and others may not be configured to send the
parameter. Therefore, user agents must not assume any default value
for the "charset" parameter. —
HTML 4.01
Thus you should always include a charset declaration
of the form in the <head> element:
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
Your HTML editor will usually give you an easy way to do
this. For example, in Microsoft FrontPage you select File >
Properties > Language > HTML Encoding—Save the document as, and
pick the encoding you want.
Q: How should I encode international
characters in URLs?
A: See
http://www.w3.org/TR/charmod/#URIs
Q: The appearance of some of the pages on
the Unicode site are flawed by the inclusion of illegal characters. John
Walker has written an amusing article and an excellent program to purge
documents of these problems; see
http://www.fourmilab.ch/webtools/demoroniser/
A: The demoronizer seems to have a bug in it. The page is
written in UTF-8, and it contains the character U+2014 (EM DASH), which
is a perfectly reasonable character. In UTF-8 it is encoded as the byte
sequence E2 80 94. It appears that demoronizer is ignoring the charset
parameter, interpreting it as iso-8859 or some other charset, seeing the
80 byte and marking it as an error. We generally try to run our pages
through the W3C validator , which
has been upgraded to recognize UTF-8.
Q: We are setting up a database for use
with our web server. I understand that if I want to store data into a
database, I need to use a consistent character encoding scheme. Does
Unicode cover all the character sets we need, for a web server ?
A: Yes, Unicode works perfectly on the backend for keeping
all of your data in a consistent format.
Q: Now comes the problem of delivery of
pages. Since we will have text from different languages and scripts on
our pages, what are our options?
A: In HTML (or XML) you can either use NCRs, or you can
choose a charset that will contain all of the characters on the page.
Q: What are NCR’s and CER’s?
A: Instead of simply including a character such as an “a”
in a file, you can instead write it using the character code, as
“a”(the hex value) or “a” (the decimal value, you can find
these in the code charts at
http://www.macchiato.com/unicode/charts.html).
Few people use this for ASCII, of course, but it does allow
you to put the occasional character such as a trademark sign (™) or
alpha (α) in your text. CER’s are similar, except that they use
abbreviations, such as “é” instead of
numbers.
Q: What are the pro’s of using NCR’s (and
CER’s) ?
A: NCRs can be useful when:
a) You know what the Unicode value is (or the
abbreviation), but don’ t have a way to enter the character directly in
the output character set.
b) Your tools don’t let you edit Unicode text directly.
c) You cannot tell which of the similar looking characters
you editor is using, and want to get the precise value.
Q: What are the con’s of using NCR’s?
A: NCRs are:
a) hard to maintain (do you read code points and/or
abbreviations as well as text?)
b) hard to format
c) not well handled by many search engines
d) most importantly: not compatible with as many browsers
as UTF-8
Q: How can I ensure that my document uses
an encoding that will not require the use of NCR’s?
A: If you need a multilingual document that spans charsets
or you do not want to have to keep track of such things, then UTF-8 is
the best alternative. Using UTF-8 directly is much more maintainable
than using NCRs, since it is far easier for people to work with the text
than with the codepoints. To set the charset to be UTF-8, use the
following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Q: Will my HTML editor automatically fix
NCR’s for me?
A: Yes, when you reset the charset on a page, if the right
option is set a good editor will add NCR’s when necessary, and convert
unnecessary ones into regular characters. For example,
<p>Σ and Я</p> // charset=utf-8</p>
<p>Σ and Я</p> // charset=iso-8859-5
<p>Σ and Я</p> //charset=iso-8859-1
<p>Σ and Я</p> // charset=iso-8859-7
Q: We are using forms in HTML. If we use
Unicode for all of our HTML pages, does that mean that once the forms
are submitted, the user input also gets back to unicode (i.e. the
webserver is able to map the local charset with the unicode one) ?
A: If you have a single CGI and a single HTML form, then
the browsers will return the data in the encoding of the original form,
so there is no ambiguity about the charset. If you have a single CGI and
multiple (localized) HTML forms which may be use different charsets,
then it may not be so simple. While there is a protocol for revealing
the charset of a submitted form, it is not always used. Some people use
the following skanky trick to get around this: include a hidden field in
your form with known characters in them. Based upon the bytes that get
sent to you, you can determine the charset that the user typed in. Ugly,
but it seems to work.
Q: How does that work, exactly?
A: The hidden characters will be converted to the user’s
charset (like the rest of the form) when it is submitted to you. So by
putting, say, a YE in for a Russian page, you can look at the bytes that
you receive. Based on those bytes, you decide which of the Russian
character sets were used.
Q: When we send email to people in each
country with their data – do we need to convert the unicode data coming
from the database into each individual charset ?
A: Although all modern browsers and email programs will
handle UTF-8, some people may be using emailers that do not handle
UTF-8. Since unlike HTTP there is no handshake to determine what
charsets the email program will accept, at this point in time you
probably do need to translate the charset to one that is specific for
the user. If you retain the character set in which the user corresponded
with you, you can use that. Otherwise you can use one of the common
character sets used in email in the user’s country of origin, if you
know that.
Q: I'm worried about the extra size that
my web pages will take if they are encoded in UTF-8. Won't some
languages be at a disadvantage?
A: As far as size goes, it is worthwhile looking at some
real data samples. The following are from a page on the Unicode site
that is translated into different languages, so it has essentially the
same information on each page.
|
Size |
Page |
|
8882 |
s-chinese.html |
|
8946 |
t-chinese.html |
|
9347 |
esperanto.html |
|
9498 |
maltese.html |
|
9739 |
icelandic.html |
|
9833 |
czech.html |
|
9944 |
welsh.html |
|
10064 |
danish.html |
|
10109 |
swedish.html |
|
10127 |
polish.html |
|
|
Size |
Page |
|
10219 |
interlingua.html |
|
10221 |
italian.html |
|
10297 |
spanish.html |
|
10308 |
portuguese.html |
|
10312 |
lithuanian.html |
|
10329 |
german.html |
|
10376 |
romanian.html |
|
10401 |
korean.html |
|
10506 |
french.html |
|
|
Size |
Page |
|
10726 |
japanese.html |
|
10953 |
hebrew.html |
|
11192 |
arabic.html |
|
13292 |
greek.html |
|
13870 |
russian.html |
|
13892 |
persian.html |
|
14549 |
hindi.html |
|
15337 |
georgian.html |
|
15853 |
deseret.html |
|
So the best case is about 50% of the worst case. Some of
this is due to the encoding, and some is due to different languages just
using different numbers of characters. However, when you look at web
pages in general use, the amount of text (in bytes) is really swamped by
graphics, Javascript, HTML code, and so on. So fundamentally, even the
variations above are not that important in practice.
Q: Where can I find out more about using Unicode on the Web?
A: It turns out that the W3C does, in fact, maintain FAQs and HTML authoring guidelines
for international users under the auspices of the Internationalization Working Group's GEO Task Force,
which you can find at http://www.w3.org/International/geo [AP]
Q: Are there any more resources about Unicode on the web?
A: There are also lists that that specialize in answering questions about Web technology.
For example, www-international@w3.org. Information on subscribing to that list is at
http://www.w3.org/International/core [AP]