Re: Unicode enabled OCR software

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jan 31 2006 - 15:46:29 CST

  • Next message: Kent_Spielmann@sil.org: "Re: Unicode enabled OCR software"

    From: "Mark E. Shoulson" <mark@kli.org>
    > Kent_Spielmann@sil.org wrote:
    >
    >>Does anyone know of OCR software solution that permits mapping to the full
    >>Unicode character set as output from the character recognition process?
    >>This needs to include mapping to base character+combining character
    >>combinations.
    >>
    >>
    > Ow. OCR to *full* Unicode sounds like it would have a lot of potential
    > problems. Within any given alphabet, you can usually count on letters to
    > look somewhat different from each other, but so many Unicode characters
    > resemble other ones, how is the program to know which to output? An "A"
    > might be Latin, Greek, or Cyrillic, and they'd all look identical (not
    > even "similar").
    >
    > Spelling dictionaries will help, and some heuristics like "probably the
    > letters are all in the same alphabet" (but maybe alphabets might have to
    > be defined across blocks, like IPA).

    What Kent says is that the software limits the codepoints that can be output from a recognized or learnt glyph pattern.
    This means that hecanperfectly recognize the glyphs but is then restricted in its selection of codepoints corresponding to the recognized glyphs.
    Sohe's looking for a software that can be used to learn customized subsets of glyphs needed for some language/script pair, and then associate the codepoints tothem more freely, without having to generate hacked transliterations, and later use a conversion process to generate the actual codepoints needed to match in linguistic dictionaries, or without having to transliterate those dictionnaries into a hacked script (something really not easy to do with the Office2003 spell checker).

    As he speaks about recognizing mostly characters that are not in a Windows ANSI codepage, this may be for example an OCR adapted to South-Asian language/script pairs (Hindi/Devanagari for example) which could certainly be recognized by OCR,given that there are not somany glyphs, and they are quite regular.

    Without it, he would need to map the recognized Devanagari letters or grapheme clusters into some awful Roman transliterations...



    This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 16:09:28 CST