Where
is my Character?Very rough draft!
If you are trying to find a specific character in Unicode, the first place to go is to the code charts. These are in the published Unicode Standard 3.0 (Addison-Wesley Pub Co; ISBN: 0201616335). You can also find characters in the online charts. For each character you will find a code point: a hexadecimal number that is used to represent that character in computer data.
You may not find the character in what you think is the obvious spot. The characters in Unicode are grouped into blocks, but this is only a rough grouping because characters can be categorized many different ways. In particular, punctuation and symbols are applicable across a very wide range of usages and scripts. Even the notion of a script itself is not black and white; text in a given language may make use of characters from multiple scripts.
Thus you may need to look in several locations to find your character. You may find the Character Index in the Unicode Standard helpful for this.
You may not find a character simply because the charts do not specify the exact shape; they only provide a representative shape for identification. (The very term character is rather vague, and may be interpreted broadly or narrowly. In this document, we'll use a very broad sense. For more details, see UTR #17: Character Encoding Model.)
For example, a lowercase Cyrillic p could appear as any of the following (the second is customary for italic in Russia, and the third is customary for italic in Serbia):
[Ed note: I am trying different ways of "boxing" the characters, and am using different fonts. This will be made uniform in the final version.]

Similarly, characters are typically written differently within the Japanese, Taiwanese, Korean, and Mainland Chinese typographic traditions. These differences of writing style used for the characters in the Unicode Standard are all within the general range of allowable differences within each typographic tradition. For more details, see the Unicode FAQ.
If you have some question about whether a particular character is the one you are looking for, enter it into a document, select it, and select different fonts. If you are using the character in a document where it is important to have a precise shape, make sure the fonts that could be used to display it are correct.
Characters may also take on different shapes in different contexts. So, for example, the Arabic character hah may have four different basic shapes.
| Character | Possible shapes in context | |||
|---|---|---|---|---|
|
|
||||
The character you are looking for may be represented as a sequence of code points in Unicode. Here are examples of such characters, and the representation:
| Character | Code Points | Comments |
|---|---|---|
| 0063 0068 | Slovak, traditional Spanish | |
| 0074 02B0 | Native American languages | |
| 0078 0323 | ||
| 019B 0313 | ||
| 00E1 0328 | Lithuanian | |
| 0069 0307 0301 | ||
| 30C8 309A | Ainu in kana transcription |
In some rare instances, you will find apparently identical characters. In most cases, if not all, this is to maintain compatibility with the original source standards for Unicode; vendor, national, and international character standards in wide usage in 1990. There are also particular shapes of characters that are given separate code points in Unicode, such as the shapes of the Arabic character hah listed above. These were also added to Unicode because of pre-existing standards.
In a few cases, Unicode separates particular characters on the basis of strong differences in properties. For example, the following characters are distinguished on this basis, even though the range of possible shapes are the same.
| Modifier letter prime. Is treated as a letter. Used to transcribe the "soft" sign in Cyrillic. | |
| Prime. Treated as a punctuation mark or symbol. Used in mathematics, and as a symbol for minutes (fractions of degrees). |
In those rare cases where this occurs, to decide which character to use you should consult the text of the Unicode Standard. Programmers can also look at the Unicode Character Database.
Finally, your character may not yet be encoded in Unicode. There is a well defined submission process for new characters. This process verifies that the proposed character is in fact a candidate for encoding. In some cases, this process may not be straightforward: for example, Egyptian hieroglyphs have not yet been encoded because there is not yet general agreement on the exact repertoire of characters.
Because the Unicode standard and ISO 10646 are synchronized in character codes, both organizations need to agree to the encoding of new characters. While this generally ends up with better quality, it can require several years before a new character is accepted into the standard, and some time beyond that before it is fully supported in products.
For compatibility with pre-existing standards, there are characters that are equivalently represented either as sequences of code points or as a single code point (called a composite character). For example, the i with 2 dots in na�ve could be presented either as i + diaeresis (0069 0308) or as the composite character i + diaeresis (00EF).
There are other cases where the order of two combining characters does not matter. For example, the pair of combining characters acute and dot-below can occur with either one first; both alternate orders are equivalent.
Due to the requirements for uniqueness � especially in the Internet � Unicode provides for a unique format, called Form C. This format always picks one of the equivalent code points (or sequences of code points) and not the other. It also picks a specific order where there are alternatives. For more information, see UTR #15: Unicode Normalization Forms.
However, programs that require uniqueness also require forward compatibility: programs all over the web must be able to depend on the unique format not changing over time. That means that characters that are currently representable as sequences will always stay representable as sequences. Even if a composite character were to be introduced, it will not be in Form C.
Note: Simply because a character may have a different sorting order does not qualify it to be given a separate code point in Unicode. For more information, see UTR #10: Unicode Collation Algorithm. This is true whether the character is represented by a single code point or a sequence.