Basic Questions
Q: What is Unicode?
A: Unicode is the universal character encoding, maintained by
the Unicode Consortium.
This encoding standard provides the basis for processing, storage and interchange of text data in any language in
all modern software and information technology protocols. See "What is Unicode?" for a short explanation of
what Unicode is all about. That page is translated into more than 50 languages,
to illustrate the use of the standard. See for yourself!
Q: What is the scope of Unicode?
A: Unicode covers all the characters for all the writing
systems of the world, modern and ancient. It also includes technical
symbols, punctuations, and many other characters used in writing text.
The Unicode Standard is intended to support the needs of all types of
users, whether in business or academia, using mainstream or minority
scripts.
Q: How many languages are covered by
Unicode?
A: It's hard to say, because Unicode encodes scripts for
languages, rather than languages per se. Many scripts (especially the
Latin script) are used to write a large number of languages. The easiest
answer is that Unicode covers all of the languages that can be written in
the following scripts:
Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari,
Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic,
Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian,
Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, and Yi.
Unicode also includes many historic scripts used to write long-dead
languages, as well as lesser-used regional scripts that may be used as a
second (or even third) way to write a particular language. See
Supported Scripts
for the full list. See also the list of
Languages and Scripts.
Q: Does Unicode encode scripts or languages?
A: The Unicode Standard encodes characters on a per script
basis. So, for example, there is only one set of Latin characters defined,
despite the fact that the Latin script is used for the alphabets of
thousands of different languages. The same principle applies for any other
script (Cyrillic, Arabic, Ethiopic, Devanagari, ...), which is used for
writing many different languages. However, the Unicode Standard does not
encode scripts per se. For a listing of script names, see
UTR #24,
Unicode Script Property. For the ISO standard for script codes, see
ISO/IEC 15924,
Code for the Representation of Names of Scripts. For the ISO standard for
language codes, see ISO 639, Code for the Representation of Names of
Languages.
Q: Why does Unicode unify Chinese, Japanese, and Korean ideographs, but not unify the Latin, Greek, and Cyrillic alphabets?
A: For a detailed answer, see
http://www.unicode.org/notes/tn26/.
Q: What's the connection between Unicode and the International Standard,
ISO/IEC 10646?
A: Both 10646 and Unicode specify the same character encoding: they
contain the same characters at the same locations. They remain fully
synchronized even as they are extended to cover additional characters. See the
Unicode and ISO
10646 FAQ and
Appendix
C of the Unicode Standard for a more extensive explanation of their
relationship.
Q: I think my company might want to get involved in Unicode. Is there any material that I can use to present the case to my management?
A: Yes, there is a white paper outlining the overall value proposition of a Unicode membership to an organization.
See Why Join
and How to Join.
Q: Where can I purchase the Unicode software or the Unicode font?
A: The Unicode Standard is not a software program, nor is
it a font. It is a character encoding system, like ASCII, designed
to help developers who want to create software applications that work in
any language in the world.
If all you need is to create a multilingual text or write a
document or send e-mail in another language, then a Unicode-compliant text
editor, mail program, or word processing package will do the job. Please see the following pages on our web site for further information
about the standard and where to look for help:
If you are a developer starting to learn about using Unicode,
you should read the
latest version of the
Unicode Standard to find
out more about Unicode. In addition to the pages listed above, please see:
Q: My computer cannot display some of the latest Unicode symbols I need. I tried downloading and extracting the latest Unicode data files from the Unicode web site, but it has no effect on the characters my computer can display or type. How can I display and type the latest Unicode characters?
A: The Unicode data files do not function like a software patch, and cannot automatically update existing fonts or applications, so downloading the files will not help in displaying and typing the Unicode characters needed.
The reason you don't see the characters as expected is most likely because you need to install a font that covers the set of Unicode characters you are trying to see. Other possible reasons might be that:
your operating system needs to be updated (older operating systems such as Windows XP, which came out in 2001, don't provide expected support for some new characters)
-
your application doesn't support Unicode properly (though most do)
If you need to install a font to resolve the problem, free fonts can be downloaded for many Unicode ranges. See Font Resources, or search in your browser for the name of the font you need. Fonts typically cover only one script, or sometimes a range of scripts. Often fonts haven't been updated to render the most recent additions to the Unicode character set.
See also Display Problems?
Q: I can't find my character in Unicode.
Where do I look?
A: Look at "Where
is my Character?"
Q: Where do I find information on the use of characters for a given writing system or script?
A: The block introductions found in Chapters 7 through 20 of the Unicode Standard are a good place to start. Another place to look is the comments contained in the names lists, which accompanies the code charts, although the comments are not intended to be encyclopedic. The data files in the Unicode Character Database provide information, often in machine-readable form, on character properties, linebreaking, wordbreaking, and so on.
Q: Are script descriptions in the block introductions complete?
A: No. They cover the information necessary to define the encoded characters, but issues such as usage conventions, layout behavior and glyph design are usually covered only as far as needed to help establish the identify of an encoded character.
Q: Where do I go to find more information about characters for a given script?
Consult the bibliography in the References section of the Unicode Standard (section R.3) Also check the original proposals to encode the scripts. Those are the documents in which the characters were proposed for encoding. While the proposals are not authoritative and do not have any formal status, they were used in the process of committee deliberation. They often contain useful information, including examples or lists of references.
Q: Where do I find script proposals for a specific script?
Most proposals are available in the UTC Document Registry. You can also search for specific topics on the Unicode website to find proposals. Many proposals are also available on the JTC 1/SC2/WG2 website. Individually maintained websites may also include links to particular script proposals.
Q: Where can I find resources to help me with Unicode?
A: Here's a short table that suggests links to information that can answer typical questions.
Q: What does Unicode conformance require?
A: Chapter 3, Conformance discusses this in detail. Here's a very informal
version:
-
Unicode characters don't fit in 8 bits; deal with it.
-
2 Byte order is only an issue in I/O.
-
If you don't know, assume big-endian.
-
Loose surrogates have no meaning.
-
Neither do U+FFFE and U+FFFF.
-
Leave the unassigned codepoints alone.
-
It's OK to be ignorant about a character, but not plain
wrong.
-
Subsets are strictly up to you.
-
Canonical equivalence matters.
-
Don't garble what you don't understand.
-
Process UTF-* by the book.
-
Ignore illegal encodings.
-
Right-to-left scripts have to go by bidi rules.
[JC]
Q: Can applications simply use unassigned
characters as they wish?
A: No! No conformant Unicode implementation can use the
un-encoded values outside of the private use area.
Only the values in the private use areas (U+E000..U+F8FF,
U+F0000..U+FFFFD, and U+100000..1U+0FFFD) are legal for private assignment.
However, this is over 137,000 code points, which should be more than
ample for the vast majority of implementations.
Q: Are surrogate characters the same as
supplementary characters?
A: This question shows a common confusion. It is very
important to distinguish surrogate code points (in the range
U+D800..U+DFFF) from supplementary code points (in the completely
different range, U+10000..U+10FFFF). Surrogate code points are reserved
for use, in pairs, in representing supplementary code points in UTF-16.
There are supplementary characters (i.e. encoded characters
represented with a single supplementary code point), but there are not and
will never be surrogate characters (i.e. encoded characters represented
with a single surrogate code point).
Q: What can I do if I think there is an error in the Unicode Standard or other specification?
A: Request a correction, clarification or change to the relevant specification by submitting feedback or a formal proposal to the
corresponding technical committee (UTC or
CLDR-TC). See
Public Review Issues for an explanation of how to do this. (The methods
are different for the two committees and the type of change requested.)