Re: Cost of no OCR for extended Latin

From: David Starner (
Date: Thu Oct 25 2007 - 05:41:19 CDT

  • Next message: Mark E. Shoulson: "Re: Level of Unicode support required for various languages"

    On 10/25/07, Don Osborn <> wrote:
    > I suspect this is a general problem going back to a lack of OCR that
    > recognizes extended characters, or at least the scanning of this particular
    > book did not recognize the characters.

    It's hard to get good OCR without knowing what characters you're
    looking for. O with ~ above, after real life typesetting and scanning,
    could be a macron or circumflex or a tilde above, or a bare O with a
    smear of ink.

    > Is anyone aware of an OCR system that recognizes extended Latin characters
    > from say Extended A&B, IPA, and Extended Additional ranges? That is for any
    > language (orthography) including these characters?

    ABBYY offers most of Extended A and some of Extended B and Additional.
    The list of supported languages is
    <>, which should map to
    the list of supported characters. It would be hard to impossible to
    create and test an OCR without a substantial corpus of material using
    a character; I suspect many languages are on ABBYY's list only because
    the orthography is a subset of those supported for other reasons.

    > I've been discussing scanning of African language materials as part of books
    > online programs. The good news is a little of that has been started, but it
    > is definitely not good news if the scanning is being done (in some or all
    > cases) without the right OCR.

    Why? Once you have the scans, you can always reOCR. There's no way
    that any automated scanning program is going to handle unusual text
    like African language materials as well as someone who's focused and
    familiar with them.

    This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 05:43:43 CDT