Unicode and the Web

Q: What character encoding should I use for my Web pages?

You should always use UTF-8. See the W3C article Choosing & applying a character encoding external link .

Q: What charset declaration should I use?

Usually, your editor will include this automatically when creating a blank page. When creating a new page manually use:

<meta charset="utf-8">

with “utf-8” (or case insensitive variations) as the only recommended value for new documents.

Q: How should I encode international characters in URLs?

For details and basic concepts, see this Introduction to Multilingual Web Addresses external link . For other issues in IDNs see the IDN FAQ

Q: We are setting up a database for use with our web server. Does Unicode cover all the character sets we need for a web server?

For a database, it is not only necessary to be able to encode all the data, but critically important to use a consistent character encoding scheme. Unicode greatly simplifies the storage of multilingual data, since most non-Unicode encoded database data will be in multiple languages and code pages, and managing or extending that is unnecessarily complex and difficult.

Unicode works perfectly on the backend, while covering all the characters you need. (Unicode explicitly contains many characters solely to guarantee complete conversion from common non-Unicode character sets).

Q: How should we handle text in different languages and scripts on our pages?

All modern browsers handle characters as Unicode internally, so for HTML (or XML) you should simply set UTF-8 as the charset. (And if your internal data aren't in Unicode yet, make sure to convert them as you build the page).

Q: What are Numerical and Named Character References?

Instead of simply including a character such as an “a” in a file, you can instead write it using the Unicode code point value as a Numerical Character Reference (NCR), such as a (using the hex code point value) or “a” (using the decimal code point value). For help with calculating hexadecimal and decimal NCRs, see the Unicode code converter external link page.

Named character references are similar, except that they use abbreviations, such as é for “é” instead of numbers.

This can be useful when you don’t have a character on your keyboard, such as a trademark sign (™) or alpha (α). It can also be useful to clarify visually ambiguous characters in your source code, such as distinguishing a non-break space ( / ) from an ordinary space, or to make it clear the use of invisible characters or visually ambiguous characters in your source code (such as ‏‏/&rlm;‏ for the RIGHT-TO-LEFT MARK).

You should avoid overuse of NCRs because they make it harder to read source text when direct character input would suffice. It also takes longer to create them. A similar character escape mechanism can be used in CSS, but the format is slightly different. For more information about character escapes on the Web see the W3C page Using character escapes in markup and CSS external link .

Use of NCRs interferes with automatic normalization applied by some editors. This can be desirable in documents that discuss normalization and need to show examples, but should otherwise be avoided. [RI]

Q: Will my HTML editor automatically fix NCRs for me?

A good editor can be configured to convert NCRs to characters.

Q: When we send email to people in each country with their data – do we need to convert the Unicode data coming from the database into each individual charset?

Although all modern browsers and email programs will handle UTF-8, some people may be using emailers that do not handle UTF-8. Since unlike HTTP there is no handshake to determine what charsets the email program will accept, at this point in time you probably do need to translate the charset to one that is specific for the user. If you retain the character set in which the user corresponded with you, you can use that. Otherwise you can use one of the common character sets used in email in the user’s country of origin, if you know that.

Q: Where can I find out more about using Unicode on the Web?

The W3C maintains FAQs and HTML authoring guidelines under the auspices of the Internationalization Working Group: http://www.w3.org/International/core external link . You can also find information there about subscribing to lists that specialize in answering questions about Web technology and Unicode. For background reading on character encoding issues in the W3C context see https://www.w3.org/International/techniques/authoring-html#charset. [AP]