Re: "Missing character" glyph

From: Martin Kochanski (unicode@cardbox.net)
Date: Thu Aug 01 2002 - 06:11:08 EDT


The responses from this mailing list have made me re-think the problem and propose a possible solution.

The point about missing characters (more accurately, "unrendered characters") is that different fonts (more accurately, different combinations of font plus rendering system) display them in different ways. I have seen hollow squares and rectangles; filled rectangles; small diamond-shaped bullets; and question marks.

Unrendered characters will become more noticeable as Unicode becomes more widespread and computing increasingly transcends linguistic and script boundaries. On the whole, with existing 7-bit and 8-bit national standards, a user in any particular country will find that any character that can be encoded can also be displayed, so that the distinction between encodable and displayable characters is one that simply does not need to occur to an ordinary user. But someone using Unicode to view (for example) Web pages from another country may find that the fonts on his computer are missing some vital characters, which the computer then renders in an arbitrary way (as hollow squares, etc); leading to puzzlement and confusion. Eventually, as "large" Unicode fonts become more widely installed, the problem will diminish; but it will never entirely go away unless the Unicode standard stops evolving.

There is a need to talk about what an unrendered character looks like when explaining the concept to a user and explaining that special actions may need to be taken (for instance, changing fonts or downloading a new version of a font).

Printed manuals can handle unrendered characters quite easily. The manual can use one arbitrarily chosen appearance (such as U+25AF or U+2337) for unrendered characters, with a note (on first occurrence) that the screen appearance of unrendered characters may vary - screenshots can be given as examples.

On-screen text does, however, present problems: especially Web pages. The writer of the text has no control over the font that will be used to display it [in some cases he may be able to specify or request the *name* of the font to be used, but this is no guarantee that the font of that name will contain all the needed characters or that it will even be installed on the user's computer]. There is a need to be able to say in a web page: "If some of the text on this page looks like this: ????? then you should install font XXXX / download a new font from [link]" - where ????? looks *exactly* how an unrendered character would look in the font that the web page is being displayed with.

No presently defined Unicode character can be used to represent <?> in the above message. A hollow rectangle such as U+25AF or U+2337 will only resemble the screen appearance of unrendered characters if the font being used happens to use that particular sort of hollow rectangle to represent unrendered characters: in a font that uses small diamonds, representing <?> as a hollow square would be confusing counter-productive.

For the same reason, a bitmap cannot be used: a bitmap's appearance will not vary automatically as the font used to display the message changes.

Rewriting the message to say "If a lot of the text on this page looks like hollow squares or small solid rectangles or little diamonds or anything else strange, then you should install font XXXX / download a new font from [link]" is not a practical solution because it adds complexity, obscurity, and verbosity; adds a level of abstraction that it is neither necessary nor easy for the user to follow; and uses up valuable screen space.

It follows that there is a need for a defined Unicode character that represents the appearance of an unrendered character in the font in which it is displayed.

I am wondering whether it would be worth submitting a proposal for such a character. For example:
        U+024F UNRENDERED CHARACTER

While the addition of characters to Unicode is something to be done only as a last resort, I believe that there is, in this case, no alternative.

Such a character proposal would have the advantage that every existing Unicode font *already* implements it correctly - by definition [but see the note below about section 5.3 of the Unicode standard]. Thus no changes will be needed to fonts or to rendering engines.

To look at it another way, virtually the only action that the Unicode Consortium needs to take to define UNRENDERED CHARACTER is to promise never to define a character at that code point.

UNRENDERED CHARACTER has to be part of the BMP for backward compatibility: it should be renderable as a single glyph, not as a pair of glyphs, even on old systems that do not understand surrogates. The proposed positioning is intended to persuade older systems that this character should be rendered conventionally, like a Latin letter.

The nearest possible alternatives are:

U+FFFE - on at least some Windows systems, this is displayed correctly (ie. identically to characters that are missing from the current font); but in the Unicode standard it has the explicit semantics of not being a character at all, and so ought not to be intentionally used as a character (a rendering engine would be within its rights to suppress it altogether; some application programs might report errors or even become confused about byte ordering).

U+FFFD - on at least some Windows systems, this is displayed correctly (ie. identically to characters that are missing from the current font); but in the Unicode standard it has the explicit semantics of being a replacement for a character *unrepresentable in Unicode*. A character unrepresentable in Unicode is not the same as a Unicode character that happens not to have a representation in the current font. It is possible that a particular font may have distinctive visual representations of U+FFFC and U+FFFD that are distinct from the way that it draws unrendered characters.

Otto Stolz suggested U+03A2, which would be equally valid. However, U+03A2 is quite obviously the code for GREEK CAPITAL LETTER FINAL SIGMA. For O.S., this is a reason for using the code (because there is, in fact, no such letter, so the code can be used); for me, this is a strong reason for *not* using the code, because if it **ever** became necessary to encode GREEK CAPITAL LETTER FINAL SIGMA then no character other than U+03A2 would be acceptable, whereas U+024F has no inherent semantics at all.

Section 5.3 of the Unicode standard makes a distinction between unassigned and unrenderable characters. Systems that make use of this distinction are an exception to the statement I made earlier that "every existing Unicode font already renders UNRENDERED CHARACTER correctly". Nevertheless, the rendering of UNRENDERED CHARACTER as "unassigned" rather than "unrenderable" is unlikely to cause much confusion.

One other exception would be a pathologically helpful font/engine that represents each unrendered character as a unique glyph (for example, a miniature of the character's hexadecimal value). This, again, would not be a problem: the user will instantly recognize "miniature 024F" as being different from ordinary characters and in the same class as the "miniature 021D" glyphs that disfigure the page.

Would it be worth submitting a proposal for UNRENDERED CHARACTER? As I said, it *is* adequately implemented already: the only purpose for wanting it defined in the standard is to prevent the implementation from being suddenly broken in the future.

- Martin Kochanski.



This archive was generated by hypermail 2.1.2 : Thu Aug 01 2002 - 04:15:36 EDT