Re: Cost of no OCR for extended Latin

From: David Starner (prosfilaes@gmail.com)
Date: Thu Oct 25 2007 - 05:41:19 CDT

Next message: Mark E. Shoulson: "Re: Level of Unicode support required for various languages"

Previous message: Vinod Kumar: "Re: Level of Unicode support required for various languages"
In reply to: Don Osborn: "Cost of no OCR for extended Latin"
Next in thread: Lorna_Priest@sil.org: "Re: Cost of no OCR for extended Latin"
Reply: Lorna_Priest@sil.org: "Re: Cost of no OCR for extended Latin"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 10/25/07, Don Osborn <dzo@bisharat.net> wrote:
> I suspect this is a general problem going back to a lack of OCR that
> recognizes extended characters, or at least the scanning of this particular
> book did not recognize the characters.

It's hard to get good OCR without knowing what characters you're
looking for. O with ~ above, after real life typesetting and scanning,
could be a macron or circumflex or a tilde above, or a bare O with a
smear of ink.

> Is anyone aware of an OCR system that recognizes extended Latin characters
> from say Extended A&B, IPA, and Extended Additional ranges? That is for any
> language (orthography) including these characters?

ABBYY offers most of Extended A and some of Extended B and Additional.
The list of supported languages is
<http://www.abbyy.com/finereader8/?param=44927>, which should map to
the list of supported characters. It would be hard to impossible to
create and test an OCR without a substantial corpus of material using
a character; I suspect many languages are on ABBYY's list only because
the orthography is a subset of those supported for other reasons.

> I've been discussing scanning of African language materials as part of books
> online programs. The good news is a little of that has been started, but it
> is definitely not good news if the scanning is being done (in some or all
> cases) without the right OCR.

Why? Once you have the scans, you can always reOCR. There's no way
that any automated scanning program is going to handle unusual text
like African language materials as well as someone who's focused and
familiar with them.

Next message: Mark E. Shoulson: "Re: Level of Unicode support required for various languages"
Previous message: Vinod Kumar: "Re: Level of Unicode support required for various languages"
In reply to: Don Osborn: "Cost of no OCR for extended Latin"
Next in thread: Lorna_Priest@sil.org: "Re: Cost of no OCR for extended Latin"
Reply: Lorna_Priest@sil.org: "Re: Cost of no OCR for extended Latin"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 05:43:43 CDT