[Unicode]  Frequently Asked Questions Home | Site Map | Search

Unicode and the Web

Q: My web page is in Latin-1 (ISO-8859-1). So I don't need a charset declaration, right?

A: Wrong. You always need a charset declaration, even when you are using Latin-1. To quote from the HTML specification:

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter. — HTML 4.01

Thus you should always include a charset declaration of the form in the <head> element:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Your HTML editor will usually give you an easy way to do this. For example, in Microsoft FrontPage you select File > Properties > Language > HTML EncodingSave the document as, and pick the encoding you want.

Q: How should I encode international characters in URLs?

A: See http://www.w3.org/TR/charmod/#URIs

Q: The appearance of some of the pages on the Unicode site are flawed by the inclusion of illegal characters. John Walker has written an amusing article and an excellent program to purge documents of these problems; see http://www.fourmilab.ch/webtools/demoroniser/

A: The demoronizer seems to have a bug in it. The page is written in UTF-8, and it contains the character U+2014 (EM DASH), which is a perfectly reasonable character. In UTF-8 it is encoded as the byte sequence E2 80 94. It appears that demoronizer is ignoring the charset parameter, interpreting it as iso-8859 or some other charset, seeing the 80 byte and marking it as an error. We generally try to run our pages through the W3C validator , which has been upgraded to recognize UTF-8.

Q: We are setting up a database for use with our web server. I understand that if I want to store data into a database, I need to use a consistent character encoding scheme. Does Unicode cover all the character sets we need, for a web server ?

A: Yes, Unicode works perfectly on the backend for keeping all of your data in a consistent format.

Q: Now comes the problem of delivery of pages. Since we will have text from different languages and scripts on our pages, what are our options?

A: In HTML (or XML) you can either use NCRs, or you can choose a charset that will contain all of the characters on the page.

Q: What are NCR’s and CER’s?

A: Instead of simply including a character such as an “a” in a file, you can instead write it using the character code, as “&#x61;”(the hex value) or “&#97;” (the decimal value, you can find these in the code charts at http://www.macchiato.com/unicode/charts.html).

Few people use this for ASCII, of course, but it does allow you to put the occasional character such as a trademark sign (™) or alpha (α) in your text. CER’s are similar, except that they use abbreviations, such as “&eacute;” instead of numbers.

Q: What are the pro’s of using NCR’s (and CER’s) ?

A: NCRs can be useful when:

a) You know what the Unicode value is (or the abbreviation), but don’ t have a way to enter the character directly in the output character set.

b) Your tools don’t let you edit Unicode text directly.

c) You cannot tell which of the similar looking characters you editor is using, and want to get the precise value.

Q: What are the con’s of using NCR’s?

A: NCRs are:

a) hard to maintain (do you read code points and/or abbreviations as well as text?)

b) hard to format

c) not well handled by many search engines

d) most importantly: not compatible with as many browsers as UTF-8

Q: How can I ensure that my document uses an encoding that will not require the use of NCR’s?

A: If you need a multilingual document that spans charsets or you do not want to have to keep track of such things, then UTF-8 is the best alternative. Using UTF-8 directly is much more maintainable than using NCRs, since it is far easier for people to work with the text than with the codepoints. To set the charset to be UTF-8, use the following meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Q: Will my HTML editor automatically fix NCR’s for me?

A: Yes, when you reset the charset on a page, if the right option is set a good editor will add NCR’s when necessary, and convert unnecessary ones into regular characters. For example,

<p>Σ and Я</p> // charset=utf-8</p>
<p>&#931; and Я</p> // charset=iso-8859-5
<p>&#931; and &#1071;</p> //charset=iso-8859-1
<p>Σ and &#1071;</p> // charset=iso-8859-7

Q: We are using forms in HTML. If we use Unicode for all of our HTML pages, does that mean that once the forms are submitted, the user input also gets back to unicode (i.e. the webserver is able to map the local charset with the unicode one) ?

A: If you have a single CGI and a single HTML form, then the browsers will return the data in the encoding of the original form, so there is no ambiguity about the charset. If you have a single CGI and multiple (localized) HTML forms which may be use different charsets, then it may not be so simple. While there is a protocol for revealing the charset of a submitted form, it is not always used. Some people use the following skanky trick to get around this: include a hidden field in your form with known characters in them. Based upon the bytes that get sent to you, you can determine the charset that the user typed in. Ugly, but it seems to work.

Q: How does that work, exactly?

A: The hidden characters will be converted to the user’s charset (like the rest of the form) when it is submitted to you. So by putting, say, a YE in for a Russian page, you can look at the bytes that you receive. Based on those bytes, you decide which of the Russian character sets were used.

Q: When we send email to people in each country with their data – do we need to convert the unicode data coming from the database into each individual charset ?

A: Although all modern browsers and email programs will handle UTF-8, some people may be using emailers that do not handle UTF-8. Since unlike HTTP there is no handshake to determine what charsets the email program will accept, at this point in time you probably do need to translate the charset to one that is specific for the user. If you retain the character set in which the user corresponded with you, you can use that. Otherwise you can use one of the common character sets used in email in the user’s country of origin, if you know that.

Q: I'm worried about the extra size that my web pages will take if they are encoded in UTF-8. Won't some languages be at a disadvantage?

A: As far as size goes, it is worthwhile looking at some real data samples. The following are from a page on the Unicode site that is translated into different languages, so it has essentially the same information on each page.

Size

Page

8882

s-chinese.html

8946

t-chinese.html

9347

esperanto.html

9498

maltese.html

9739

icelandic.html

9833

czech.html

9944

welsh.html

10064

danish.html

10109

swedish.html

10127

polish.html

Size

Page

10219

interlingua.html

10221

italian.html

10297

spanish.html

10308

portuguese.html

10312

lithuanian.html

10329

german.html

10376

romanian.html

10401

korean.html

10506

french.html

 

Size

Page

10726

japanese.html

10953

hebrew.html

11192

arabic.html

13292

greek.html

13870

russian.html

13892

persian.html

14549

hindi.html

15337

georgian.html

15853

deseret.html

 

So the best case is about 50% of the worst case. Some of this is due to the encoding, and some is due to different languages just using different numbers of characters. However, when you look at web pages in general use, the amount of text (in bytes) is really swamped by graphics, Javascript, HTML code, and so on. So fundamentally, even the variations above are not that important in practice.

Q: Where can I find out more about using Unicode on the Web?

A: It turns out that the W3C does, in fact, maintain FAQs and HTML authoring guidelines for international users under the auspices of the Internationalization Working Group's GEO Task Force, which you can find at http://www.w3.org/International/geo [AP]

Q: Are there any more resources about Unicode on the web?

A: There are also lists that that specialize in answering questions about Web technology. For example, www-international@w3.org. Information on subscribing to that list is at http://www.w3.org/International/core [AP]