Middle Eastern Scripts and Languages
Q: How are Arabic letters represented in Unicode?
A: In normal writing, the Arabic script employs the consonantal base letters only and omits the vowels. When vowels are written,
combining marks that represent the vowels are applied to the base letter.
As the Arabic script has been adapted for writing new languages, often diacritical marks known as ijam are added to the "skeletal" consonantal
letterforms in order to differentiate additional sounds (or letters) as needed. The creation of base letter plus diacritics is
an ongoing process at work in language communities today. As new combinations are attested in language communities, the new letterforms are
encoded as a unit in Unicode. The ijam diacritical marks are not encoded separately in the Unicode Standard. [LM]
Q: Why aren't Arabic ijam diacritical marks separately encoded?
A: The reasons for encoding the new letterforms as a unit and not encoding combining diacritical marks separately are historic,
due to the evolution of the Unicode Standard. Although vowels, Koranic marks, and other pronunciation marks have been encoded as combining
marks, the consonantal base letters have consistently been encoded in Unicode as a unit. To change this practice would open the door to
multiple representations for the same letters.
The Unicode Standard provides a unique normalized representation for text, even when both precomposed and decomposed forms exist.
This model is used for Latin and other scripts. However, to provide stability for the wide range of products that use Unicode, the normalized
forms cannot change. For this reason, decomposed characters for Arabic cannot be added without having duplicate representations, which would cause serious implementation
problems, including security issues. Thus, the decision was made to keep the representation of Arabic base letterforms to indivisible units.
Q. Why are Arabic presentation forms encoded?
A. Arabic presentation forms are encoded for compatibility only, and are not recommended for use in regular Arabic text. Nor are they intended as a guide to the development of appropriate Arabic fonts. Arabic font designers should do whatever is necessary to add the full range of glyphic support to the fonts they develop.
Q: Can one use the Arabic presentation
forms in a data file?
A: It is not recommended because
it does not guarantee data integrity and interoperability. Data files
should include only the Arabic letters in the Arabic block (U+0600..U+06FF) or the Arabic Supplement block (U+0750..U+077F) or the Arabic Extended-A block (U+08A0..U+08FF). Also see Presentation Forms.
Q: Unicode includes presentation forms for Arabic, Urdu and Persian letters,
but not for letters added for Jawi (Malay written in the Arabic script). Will presentation forms be added for Jawi?
A: No, they won't. Arabic presentation forms for isolated, medial, initial, and final positional
variants were added to the standard primarily for compatibility with some older, legacy character sets that
encoded presentation forms directly. That style of text encoding is not encouraged by the Unicode Standard.
Instead, all Arabic text (including Jawi) should be represented using the Arabic letters in the Arabic
block (U+0600..U+06FF) or the Arabic Supplement block (U+0750..U+077F) or the Arabic Extended-A block (U+08A0..U+08FF).
Positional variants of Arabic letters are handled by analyzing context when rendering text.
Specific glyphs for each position (isolated, medial, initial, and final—or just isolated and final,
depending on the letter) need to be defined properly in the font, of course, but no separate
character code is required for that.
Q: I'm having trouble identifying the correct Unicode characters for some Jawi letters. Can you help?
Sure. Use U+06A0 for Jawi nga, U+06BD for Jawi nya, U+0762 for Jawi ga, and U+06CF for Jawi vi.
Note that U+0762 for ga takes the shaping of the Persian/Urdu gaf (= U+06AF), but with a dot above, instead of a line
above the letter skeleton. The letter U+06AC (a kaf with a dot above) is also sometimes used for the Jawi ga, but is not
the preferred representation.