RE: Cost of no OCR for extended Latin

From: Don Osborn (
Date: Sat Oct 27 2007 - 10:42:00 CDT

  • Next message: Philippe Verdy: "thorn vs. y or th, eth and other similar letters/signs (was: Level of Unicode support required for various languages)"

    Thanks Lorna and David for this information. I was not familiar with ABBYY
    FineReader. I have been impressed with use of OmniPage OCR for some kinds of
    text (in English mainly; occasionally French) that I expected problems with,
    due to the quality of the originals and the fact I had to scan photocopies.
    Getting at least as good or better performance in FineReader with at least
    some extended Latin would be impressive.


    One hopes to see an expansion of the extended Latin character repertoire to
    cover languages with multiple diacritics (noting the absence of, among
    others, Yoruba and Igbo in Africa, and Vietnamese in Asia).


    Part of the reason for the question is a discussion about ways to promote or
    develop a project on digitization of African language materials. Ideally one
    should OCR right the first time, and if the technology permits that for
    extended Latin orthographies, that's one less problem to overcome.


    All the best.






    From: []
    Sent: Thursday, October 25, 2007 11:09 AM
    To: Don Osborn;
    Subject: Re: Cost of no OCR for extended Latin


    > David Starner wrote on 10/25/2007 05:41:19 AM:

    > > On 10/25/07, Don Osborn <> wrote:
    > > Is anyone aware of an OCR system that recognizes extended Latin
    > > from say Extended A&B, IPA, and Extended Additional ranges? That is for
    > > language (orthography) including these characters?
    > ABBYY offers most of Extended A and some of Extended B and Additional.
    > The list of supported languages is
    > <>, which should map to
    > the list of supported characters. It would be hard to impossible to
    > create and test an OCR without a substantial corpus of material using
    > a character; I suspect many languages are on ABBYY's list only because
    > the orthography is a subset of those supported for other reasons.

    Quoting two different colleagues of mine: "I recommend FineReader
    ( from Abbyy Software. While OmniPage is good, FineReader
    is better--the best OCR software at an affordable price...FineReader can
    handle special characters better than other OCR programs."


    "I heartily recommend FineReader. It can be "trained" to recognize
    speciality characters, and it is surprisingly accurate - about 99% - which
    means that 1% of the document will require manual corrections."


    This archive was generated by hypermail 2.1.5 : Sat Oct 27 2007 - 10:44:15 CDT