Unicode enabled OCR software

From: Kent_Spielmann@sil.org
Date: Tue Jan 31 2006 - 12:43:56 CST

  • Next message: Peter Constable: "RE: two teaspoons of computational Hebrew history"

    Does anyone know of OCR software solution that permits mapping to the full
    Unicode character set as output from the character recognition process?
    This needs to include mapping to base character+combining character

    All of the software we have looked (Fine Reader, OmniPage, and Text Bridge)
    at can map to only the Unicode characters also defined in a subset of the
    ANSI code pages.

    We are trying to convert documents in minority languages and as well as
    linguistic documentation, and have need for access to a larger set of
    lesser-used characters.

    We find the situation curious since the reader that we are using (Abbyy
    Fine Reader) does output Unicode. It simply limits the selection of output
    codepoints to characters previously defined in ANSI. Allowing users to
    create custom mappings to "non-ANSI" Unicode codepoints would not seem to
    be difficult.

    We speculate the reason may be one or more of the following:
       The OCR developers may feel that, if they allow output to other code
       points, they also need to provide recognition templates for them.
       The OCR recognition software relies on spell checkers to improve output
       accuracy and apparently most spell check dictionaries do not allow
       non-ANSI characters (this is true for the Office 2003 spell checker).
       There is not enough commercial motivation for providing this capability.

    Kent Spielmann

    International Linguistics Department
    7500 W. Camp Wisdom Road,
    Dallas, TX 75236 USA
    Tel: + 1 972 708 7570

    This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 12:52:04 CST