L2/06-055 Source: Michel Suignard Date: Thursday, February 02, 2006 6:40 PM Subject: Confusable data as relative concept First it is important to state that confusability among characters cannot be an exact science. There are many factors that make confusability among character a relative concept: Shapes of characters vary greatly among fonts used to represent them. The Unicode standard represents them in the chart section with representative glyphs, but font designers are free to create their own glyphs. Because fonts can easily be created representing any Unicode code position using an arbitrary glyph, character confusability can never be avoided. For example, one could design a font where the ‘a’ looks like a ‘b’ , ‘c’ like a ‘d’, and so on. Writing systems using context shaping (such as Arabic, many south-Asian systems) introduce even more variation in text rendering. Characters don’ t really have an abstract shape in isolation and are only rendered as part of cluster of characters making words, expressions, and sentences. It is in fact a fairly common occurrence to find the same visual text representation corresponding to very different logical words that can only be recognized by context if at all. Font style variant may introduce a confusability which does not exist in another style (for example: normal versus italic). For example, in the Cyrillic script, the small letter TE (U+0442) looks like a small caps Latin ‘T’ in normal style while it looks like a small Latin ‘m’ in italic style. The tables were created using the following assumptions: Confusability tables are based on fonts used by the major Operating system for User Interface purpose. These choices tend to be dictated by the OS and are known quantity by necessity (authors of User Interface content need to rely on reasonably reliable layout to express the User Interface paradigm). In addition, the representative glyphs used in the Unicode Standard were also considered. Fonts used for User Interface are rarely changed by OS and applications. Changes brought by system upgrades tend to be gradual to avoid usability disruption. Typically, font style and/or font style do not change within a User Interface atomic element. Because User Interface elements need to be legible at low screen resolution (implying a small number of pixel per EM units), fonts used in these contexts tend to be designed in sans-serif style. Furthermore, strict bounding box requirements create even more constraints for scripts which use large ascenders and descenders comparatively to their EM size. This also limits space allocated for accent or tone marks. Of great importance are scenarios implying the use of text as identifiers, such such as IRIS (Internationalized Resource Identifiers) and their sub-elements (e.g. domain names), used in the context of User Interface elements. Although the priority is on analyzing confusability for sans-serif font style, commonly used serif fonts also need to be investigated. There are many scripts for which it is the only choice, and there are also locales where a serif style is much preferred for User Interface (for example, Chinese). The focus for mixed script confusability was between the Latin script and other scripts because this is perceived as the major threat. It is expected that the other mixes will be studied in more details in the future. In-script confusability is extremely user-dependent. For example, in the Latin script, characters with accents or appendices may look similar to the unadorned characters for some users, especially if they are not familiar with their meaning in a particular language. It is however expected that most users in position to trust identifiers will have a minimum of understanding of the scripts in which these identifiers are written. For bicameral scripts (such as Latin, Greek, Cyrillic, etc…) it makes a lot of sense to separate confusable per case (small/capital, or upper /lower). Confusable table that mixes cases are less useful because they create too many false positive. For example, mixing cases in Latin and Greek may make the Latin letters pairs {Y, U} and {N, V} confusable! Confusable data generation The initial visual confusables data was derived in the following way. A number of prospective confusables were first gathered. These were then examined under a variety of fonts available on Windows, using the assumption aboves and also compared against the representative glyphs in the Unicode Standard. Pairs of prospective confusables were removed if they were always visually distinct at common sizes, either with the pair in the same font or in different fonts. This data was then closed under transitivity (so that if X≅Y and Y≅Z, then X≅Z), and processed to produce the in-script and cross-script tables. The prospective confusables were gathered from a number of sources. Volunteers from within IBM and Microsoft, with native speakers for languages with different writing systems, gathered initial lists. The compatibility mappings were also used as a source, as were the mappings from the draft UTR #30 Character Foldings [http://unicode.org/reports/tr30/]. Eric van der Poel also contributed a list derived from running a program over a large number of OpenType fonts to catch characters that shared identical glyphs within a font. The process of gathering visual confusables is ongoing: the Unicode Consortium welcomes submission of additional mappings. In particular, it would be useful to compare glyphs from common Macintosh, Linux, and Unix fonts as well. The complex scripts of South / South East Asia also need special attention.