Where is my character

If you are trying to find a specific character in Unicode, the first place to go is to the code charts. These are in the published Unicode Standard 3.0 (Addison-Wesley Pub Co; ISBN: 0201616335). You can also find characters in the online charts. For each character you will find a code point: a hexadecimal number that is used to represent that character in computer data.

Location

You may not find the character in what you think is the obvious spot. The characters in Unicode are grouped into blocks, but this is only a rough grouping because characters can be categorized many different ways. In particular, punctuation and symbols are applicable across a very wide range of usages and scripts. Even the notion of a script itself is not black and white; text in a given language may make use of characters from multiple scripts.

Thus you may need to look in several locations to find your character. You may find the Character Index in the Unicode Standard helpful for this.

Variant Shapes

You may not find a character simply because the charts do not specify the exact shape; they only provide a representative shape for identification. (The very term character is rather vague, and may be interpreted broadly or narrowly. In this document, we'll use a very broad sense. For more details, see UTR #17: Character Encoding Model.)

For example, a lowercase Cyrillic p could appear as any of the following (the second is customary for italic in Russia, and the third is customary for italic in Serbia):

[Ed note: I am trying different ways of "boxing" the characters, and am using different fonts. This will be made uniform in the final version.]

Similarly, characters are typically written differently within the Japanese, Taiwanese, Korean, and Mainland Chinese typographic traditions. These differences of writing style used for the characters in the Unicode Standard are all within the general range of allowable differences within each typographic tradition. For more details, see the Unicode FAQ.

If you have some question about whether a particular character is the one you are looking for, enter it into a document, select it, and select different fonts. If you are using the character in a document where it is important to have a precise shape, make sure the fonts that could be used to display it are correct.

Characters may also take on different shapes in different contexts. So, for example, the Arabic character hah may have four different basic shapes.

Character	Possible shapes in context

Sequences

The character you are looking for may be represented as a sequence of code points in Unicode. Here are examples of such characters, and the representation:

Character	Code Points	Comments
	0063 0068	Slovak, traditional Spanish
	0074 02B0	Native American languages
	0078 0323
	019B 0313
	00E1 0328	Lithuanian
	0069 0307 0301	Lithuanian
	30C8 309A	Ainu in kana transcription

Duplicates

In some rare instances, you will find apparently identical characters. In most cases, if not all, this is to maintain compatibility with the original source standards for Unicode; vendor, national, and international character standards in wide usage in 1990. There are also particular shapes of characters that are given separate code points in Unicode, such as the shapes of the Arabic character hah listed above. These were also added to Unicode because of pre-existing standards.

In a few cases, Unicode separates particular characters on the basis of strong differences in properties. For example, the following characters are distinguished on this basis, even though the range of possible shapes are the same.

In those rare cases where this occurs, to decide which character to use you should consult the text of the Unicode Standard. Programmers can also look at the Unicode Character Database.

Submissions

Finally, your character may not yet be encoded in Unicode. There is a well defined submission process for new characters. This process verifies that the proposed character is in fact a candidate for encoding. In some cases, this process may not be straightforward: for example, Egyptian hieroglyphs have not yet been encoded because there is not yet general agreement on the exact repertoire of characters.

Because the Unicode standard and ISO 10646 are synchronized in character codes, both organizations need to agree to the encoding of new characters. While this generally ends up with better quality, it can require several years before a new character is accepted into the standard, and some time beyond that before it is fully supported in products.

Normalization

For compatibility with pre-existing standards, there are characters that are equivalently represented either as sequences of code points or as a single code point (called a composite character). For example, the i with 2 dots in na�ve could be presented either as i + diaeresis (0069 0308) or as the composite character i + diaeresis (00EF).

There are other cases where the order of two combining characters does not matter. For example, the pair of combining characters acute and dot-below can occur with either one first; both alternate orders are equivalent.

Due to the requirements for uniqueness � especially in the Internet � Unicode provides for a unique format, called Form C. This format always picks one of the equivalent code points (or sequences of code points) and not the other. It also picks a specific order where there are alternatives. For more information, see UTR #15: Unicode Normalization Forms.

However, programs that require uniqueness also require forward compatibility: programs all over the web must be able to depend on the unique format not changing over time. That means that characters that are currently representable as sequences will always stay representable as sequences. Even if a composite character were to be introduced, it will not be in Form C.