[Unicode]  General Information Tech Site | Site Map | Search
 

Display Problems?

During an early period in the history of the Unicode® Standard, when software products were starting to support Unicode text, it was often the case that products supported some Unicode characters and scripts but not others. This created problems for users. For instance, people who wanted to create Web content in different languages using Unicode characters couldn’t be certain that the browser used to read the content would be able to display it legibly. As a result, there was a broad need for tips on how to diagnose and solve display problems.

Today, the situation is much better. Major operating systems and browsers have broad support for Unicode characters and scripts, and legible display of Unicode text is not the widespread problem that it was in early days.

There are three kinds of text display problem that might still occur in modern software products:

Other special considerations apply to the display of Unicode Emoji, but are not covered here. For more information regarding emoji, see FAQ: Emoji and Pictographs

Lack of Font Support

Most operating systems include fonts that provide extensive coverage of Unicode characters, and most applications know how to make use of the system fonts. There may be gaps, however.

When Unicode text is displayed but there is a lack of font support for some characters in the text, the typical symptom is appearance of special character-not-supported or “tofu” glyphs. (Font vendors often refer to such glyphs as “.notdef” glyphs.) Often, this will look like a white square box (like a piece of tofu), or a box containing a question mark or diagonals. Some applications generate a fallback glyph that shows the code point for the character.

Symbols indicating not supported characters

Other symbols might also be used. Sometimes, there might just be blank space.

When this occurs, the underlying issue is most likely to be one of the following:

  • The product might not yet have been updated to support characters added in the most recent versions of the Unicode Standard.
  • An operating system might have font support, but an application running on that OS might have its own font selection or fallback logic that is not up to date with what’s available in the latest version of the OS.
  • Due to limited storage (especially on mobile devices) or other such factors, a vendor might decide not to include font support for less-frequently-used characters.

If you encounter this issue and have access to a font that does support the characters in the text, you may be able to work around the issue if the application provides a way for you to indicate that the text should be displayed with that font. In apps that support text editing, there will usually be a way to select the font used to display the text. In some cases, the app might not accept the font you select; if that happens, contact the app vendor for help.

In apps that are not text editors, getting your custom font used might require tailoring of font fallback logic used by the app. That is not a commonly-available feature. Contact the vendor to see if that is possible, or to report the gap in font support in their app.

If this issue occurs with Web content, it is likely that the content author has assumed that an appropriate font can be supplied by the browser or by the host OS the browser is running on. A better approach is for the content to use CSS Web fonts to control what fonts are used to display the content. Contact the content author to suggest that option.

Incorrect Shaping

In some situations, text might display with recognizable characters of some script, but not with the expected glyph forms, or without correct positioning of marks. For example, within Arabic-script text, you might see a character that isn’t connecting to another character as expected. Or in an Indic-script text, you might see a conjunct form, but not the expected conjunct form.

These symptoms can be due to one of three issues: incorrect encoding of text, a limitation or bug in software, or a limitation or bug in the font.

Incorrect Encoding of the Text

The content might not be using the appropriate Unicode characters for the text, or it might not be using appropriate character sequences to represent certain text elements. The text may look correct in some specific context (some specific software with a specific font), but is not represented in an interoperable way that would work as intended in other contexts.

For example, if Arabic-script text contains characters from the Arabic Presentation Forms-A or Arabic Presentation Forms-B blocks, those characters would not display with different connecting forms in different word contexts. The characters in those blocks are for legacy or special-use purposes only and should not normally be used in Arabic-script text.

Another common situation involves Indic scripts. Some characters, such as vowel letters, have an appearance that’s like a combination of other characters, but these are not considered equivalent in Unicode. For instance, U+0906 “आ” appears to be like a combination of U+0905 “अ” plus U+093E “ा”. However, that sequence is not equivalent and, in fact, is explicitly documented as not to be used. (See Table 12-1, Devanagari Vowel Letters.) Even so, some Devanagari-script content may incorrectly be using such sequences to represent the vowel letters. And some software or font implementations may intentionally be displaying the sequence with a different appearance from the vowel letter to avoid potential security issues.

If you suspect the encoded representation used in the text is the problem, contact the content author.

Software Limitation or Bug

Many scripts have complex rendering behaviours that require specific support in a rendering or “shaping” engine. An app or operating system might be able to display a default form of each character in its logical order, but not have the special logic needed to correctly shape the text so that it appears as expected for that script. Symptoms can include the following:

  • Character sequences are not displayed in the expected direction for that script (for example, left-to-right rather than right-to-left).
  • Characters of a cursive-connecting script display with disconnected glyphs.
  • Within syllable clusters, characters appear in the wrong order.
  • Within syllable clusters, marks appear on the wrong base glyphs.
  • Certain character sequences that are expected to display with a special form instead display with a different form or with the default glyphs for each character.

These symptoms point to a lack of correct shaping support in software. There might also be font issues involved, as discussed below. If the same symptoms occur when using different fonts from different vendors, that even more strongly suggests a software issue.

This could be a known limitation in the software: that version might not yet have shaping support for characters added in the most recent versions of the Unicode Standard, or the vendor might not yet have implemented support for that script. On the other hand, the vendor might have added support for the script but with proprietary logic that doesn’t follow Unicode specifications. Or, the software might simply have a bug.

If you suspect a software limitation or bug is the cause, contact the software vendor.

For particularly complex scripts, it’s also possible that the Unicode specifications for that script are incomplete. That could lead to different software implementations displaying the same character sequences in different ways, because there isn’t a complete specification for how certain text elements should be encoded, or how the encoded sequences should be displayed. If that’s the case, the Unicode Technical Committee can consider proposals to extend the specifications for that script.

Font Limitation or Bug

For scripts that have complex rendering behaviours, fonts need to be correctly implemented with certain layout data that determines what glyphs will be displayed and how they will be positioned. (This is in addition to software needing to have appropriate “shaping engine” support.) Typical symptoms include the following:

  • Characters of a cursive-connecting script do not display with the correct connecting form.
  • Marks are not correctly positioned on the base glyph or they display over spaces.
  • Certain character sequences that are expected to display with a special form instead display with a different form or with the default glyphs for each character.

It’s possible the font has an incomplete implementation. For example, the font developer may have added default glyphs for the characters of a script, matching what they see in the Unicode code charts, but not added the additional glyphs required for correct display of the script.

If you suspect a font issue, contact the font vendor, or try using a different font.

Incorrect Characters

Occasionally, you may see garbled text with incorrect characters. In some cases, you might see several occurrences of “�” or another symbol such as “?”. This is sometimes referred to as “mojibake”. These symptoms suggest an encoding error—most likely, the text went through an incorrect encoding conversion.

The most likely cause for this is that text was, at some earlier point, encoded in a legacy encoding (or character set) but was incorrectly labeled (with incorrect metadata) to indicate the exact encoding.

For example, if a file containing the text “Русский” was encoded using the Windows-1251 encoding but was not labelled as such (with metadata contained inside the file or in the repository holding the file), then an app reading that file might assume a different encoding and interpret it as different characters. For instance, the software might assume Windows-1252 encoding and then interpret the text as “Ðóññêèé”. Or, if other heuristics suggested that the text was using Big 5 (Traditional Chinese) encoding, the app would interpret the text as “唒嚭膱�”.

Good text encoding practice has always required that the encoding used for content be explicitly declared in metadata. Today, best practice is that text be encoded using a Unicode encoding form such as UTF-8.

If you suspect an encoding or encoding conversion issue is the cause of the display problem, then contact the content author. Or, if the content is maintained in some repository, contact the agency maintaining that repository.