Report to TC304 on OCR-B situation
This document briefly describes the situation for present OCR-B activities. It is primarily intended to provide background information for the TC304 June 2001 Plenary discussions and decisions in its agenda items "OCR-B testing – report" and "OCR-B – possible EN/ENV – with euro-sign and some more European letters".
The document therefore concentrates on the issues relevant for those discussions/decisions. A detailed report on the testing will be submitted after its completion..
In addition to the testing and the EN(V) development work, the document describes recent ISO/IEC JTC 1 activities relating to OCR-B.
The OCR-B testing of the Euro glyph has been completed as planned. The results have been interpreted by the Testing Project as justifying that the glyph is normatively added to the present standardized OCR-B repertoire. Work on the EN(V) has therefore been started.
Testing of the OCR-B European-language letter extensions has been started, but problems were encountered. It has therefore not yet been possible to complete this part of the work as planned. The Project however holds the opinion that it will eventually be possible to include the extensions in the EN(V), informatively.
Description of testing activities
Based on the decisions of the 1999 Tübingen TC304 meeting, its Chair and Secretary carried out discussions on OCR-B testing with LWP Consulting, Stockholm, Sweden and the University of Tübingen, Germany. These discussions resulted in a joint LWP/Tübingen proposal for the testing submitted in the autumn of 2000, and a contract for the proposed work signed with STRI in January 2001.
A detailed planning for the testing was developed by the Project, and described in an interim report 2001-01-31 (distributed in TC304 as document N964).
The theoretical design methods used when the OCR-B glyphs were originally developed can no longer be reproduced, and they are also not completely relevant in today's OCR environments. It was therefore decided by the Project to base the testing on actual OCR application situations, using typical hardware and software, i.e. flatbed scanners and a professional recognition package.
After considering different alternatives, the Project selected the "RecoStar" software from "Océ Document Technology" for the testing. The recognition repertoire of the package was extended specially for the testing by that company to include the new OCR-B glyphs, i.e. the Euro sign and a number of special letters not at present covered in the package as marketed.
It shall be noted that the "RecoStar" is designed to interpret different fonts, based on "generic" glyph shape characteristics. It is consequently not optimized specifically for OCR‑B, and the conclusions that can be drawn from the testing should therefore be "on the safe side" from a reliability point of view.
It was decided that the material to be tested should consist of a large number of "triglyphs" (groups of three consecutive characters) generated in a systematic way. For printing the triglyphs, an OCR-B PostScript Type 3 font was implemented specifically for the tests.
Euro sign recognition tests
The most important situations for recognition of the Euro sign exist in typically financial OCR applications. The Project therefore decided on a base glyph repertoire for the tests consisting of the OCR-B standard's "Basic alphanumeric subset" (with the exception of the "vertical line", which was excluded for special reasons); i.e. the digits 0-9, the capital letters A-Z, and the special characters full stop ("period"), comma, plus sign, hyphen-minus, asterisk, solidus ("slash"), equals sign, less-than sign and greater-than sign.
To this repertoire was added the financial glyphs, i.e. the euro, dollar, pound, yen and international currency (¤) signs.
In earlier work on the OCR-B euro sign, two alternative shapes for the sign were developed, a "double-bar" and a "single-bar" shape; see TC304 N837. One important objective of the testing was to investigate if any difference in recognition properties exists between these two shapes. The main testing was therefore first performed using the "single-bar" euro glyph, and was then completely repeated using the "double-bar" glyph.
For the main testing, all possible combinations of the 50 glyphs of the repertoire were generated, i.e. a total of 125000 triglyphs, and printed (on a total of 167 pages). The material was scanned and then processed by "RecoStar".
The total recognition error rate was 3.4%, for both the initial "single-bar" and the repeated "double-bar" testing. Euro sign recognition however was completely error-free for both glyph shapes.
An analysis of the errors shows that most are attributable to the fact that the software is not optimized for OCR-B. The bulk of the errors consist of mistaking letter O for digit 0 and letter I for digit 1 or vice versa. In OCR-B these particular glyphs have been carefully designed to have differing distinguishing features. Those differences are however not utilized by the software used in the testing.
Limited testings of the euro sign was also performed with reduced-quality printing, and with a larger glyph repertoire. The results of these tests do not differ in principle from those of the main tests.
Since recognition of the euro sign was completely without problems, the Project concludes that the glyph can be added to the present OCR-B repertoire.
Also, since no differences in recognition properties exist between the "single-bar" and the "double-bar" glyphs, the "double-bar" shape (most closely corresponding to the "official euro shape") should be adopted.
Letter extension recognition tests
The repertoire for this testing is based on the complete present OCR-B repertoire with the euro sign and a number of "national letters" added.
It would naturally be desirable to cover all the letters tentatively added to the OCR-B repertoire during the ISO 1073-II revision process (see the final CD for the revision, ISO/IEC JTC 1/SC 2 N 2765). This includes a number of special-shape letters, like the Œ ligature. It also includes some pre-composed letters with diacritical marks that can not be constructed according to the general rules for superpositioning the marks; like the capital and small a with ogonek. Further it includes regular combinations of diacritical marks with base letters. Finally, it includes those Greek capital letters that have no corresponding OCR-B Latin glyph.
Testing such a large repertoire of letters is not possible within the limited funding allocated to the project. The test repertoire has therefore been limited to those Latin letters in the parts of ISO/IEC 8859 that exist in the alphabets of the majority languages of the CEN member countries.
With this limited repertoire, triglyphs have been generated and printed, and scanned and interpreted by the same methods as described above for the euro sign testing (a total of 741 pages).
It turned out, however, that the height of some letters with diacritics caused difficulties for the recognition software in establishing printed lines. The printing and scanning process must therefore be repeated with a slightly larger line spacing.
Although the testing of the letter extensions has consequently been delayed, there is no indication that the letters could not be added informatively in the EN(V). The development of the standard will therefore proceed in parallell with the testing.
The present OCR-B standard ISO 1073-II was developed in the 1970s, and has a general structure which is not completely satisfactory by today's principles. It would however be unsuitable to adopt a new and different structure for the proposed EN(V), since this would confuse its relationship to the ISO standard.
It is therefore proposed that the EN(V) is based on the last 1073 revision CD ISO/IEC JTC 1/SC 2 N 2765, with the following main changes:
1. Foreword and Introduction will be adapted to CEN conditions, summarize the CEN history of the EN(V), and also describe its relationship to the ISO standard.
2. The table of glyphs (Table 2) will contain only the original OCR-B characters and the euro sign. The glyphs added in the revision process will be specified instead in an informative annex, where it will be indicated which of them that have been tested. The illustration showing all glyphs (Figure 4) will be modified acordingly.
3. The differing positioning in height for diacritical marks used with small letters as compared to those used with capital letters (or small letters with ascenders) will be clarified in the text of the standard (in subclauses 14.5 and 14.8).
4. The euro sign glyph will be specified in a normative annex according to the syntax of ISO/IEC 9541-3, and possibly also (informatively) according to PostScript syntax (which is more generally known than that of 9541).
5. An informative annex describing briefly the OCR-B testing will be added.
Cooperation with ISO/IEC JTC1
As described in the Liaison Report to JTC 1/SC 2 from the recent JTC 1/SC 31 Plenary meeting (TC304 N980), responsibility for the OCR standards ISO 1073 and ISO 1831 will be taken over from SC 2 by SC 31. Preceding the Plenary, various possibilities for cooperation between SC 31 and TC304 in connection with the development of the OCR-B EN(V) were discussed on a working level.
It was there concluded that SC 31 should not at present engage in the TC304 activities. After the EN(V) has been developed, SC 31 will review the new situation to decide if any work should be started in the matter; e.g. a "fast-tracking" of the EN(V) into an ISO/IEC revision of the OCR-B standard.