Cost of no OCR for extended Latin

From: Don Osborn (dzo@bisharat.net)
Date: Thu Oct 25 2007 - 01:26:32 CDT

  • Next message: Simon Montagu: "Re: Level of Unicode support required for various languages"

    A quick search on Google Books of a book in Fula (Fulfulde Tales of North
    Cameroon, Paul Kazuhisa Eguchi) resulted in no hits for words with extended
    Latin characters - the browser and Google handled the characters as
    expected, but the scanned and searchable text of the book apparently did not
    register the extended characters as such.

     

    I suspect this is a general problem going back to a lack of OCR that
    recognizes extended characters, or at least the scanning of this particular
    book did not recognize the characters.

     

    Is anyone aware of an OCR system that recognizes extended Latin characters
    from say Extended A&B, IPA, and Extended Additional ranges? That is for any
    language (orthography) including these characters?

     

    I've been discussing scanning of African language materials as part of books
    online programs. The good news is a little of that has been started, but it
    is definitely not good news if the scanning is being done (in some or all
    cases) without the right OCR.

     

    TIA for any feedback.

     

    Don



    This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 01:29:05 CDT