Re: Bangla

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun May 23 2004 - 09:16:50 CDT

  • Next message: D. Starner: "Re: Bangla"

    From: "Towheed Chowdhury" <nsumba@hotmail.com>
    > How bangla ocr can be developed using current unicode?

    ISO/IEC 10646 and Unicode are just standard for character encoding, not for
    their rendering and presentation.
    OCR is a difficult problem, but it has nothing in common with characters
    encoding, as it is an analysis of glyphs.
    Generally, good OCR recognition is difficult to automate without specific fonts
    with simplified or slightly altered (but still readable) glyphs.

    This is not a problem of Unicode.

    What Unicode has done is only to add some characters that were used in the OCR
    context (such as symbols on checks, that were created and printed specially for
    OCR systems, but had no prior meaning in the linguistic and plain-text area: in
    Unicode these special glyphs are coded as distinctive symbols with their own
    code points.
    OCR already has difficulties to recognize accents on modern Latin, Greek or
    Cyrillic letters, and it does not work well with other scripts (it works with
    unpointed Hebrew, but fails with Arabic due to the complex joining behavior and
    too small glyphic differences between glyphs in the most widely used typographic
    variants of the Arabic script.)
    I don't know if there has been attempt to recognize Devanagari in India.
    Hiragana and Katakan may work well in OCR, but generally Japanese texts contain
    lots of Han ideographs that are very difficult to recognize with OCR due to
    their graphic complexity.

    May be there's OCR working with Hangul basic Jamos (written linerarily, instead
    of with syllabic squares).

    In all these case, the target encoding when parsing a scanned image of a text is
    not the issue, as the difficulty is in recognizing abstract characters from many
    distinct glyph shapes that will alwyas exhibit slight variations when scanned
    from a printed paper.

    So you want to search in India if there exists some works to read Devanagari
    printed texts with OCR (Devenagari is difficult to parse too, like Arabic,
    because glyphs are most often joined, and this creates difficulties to separate
    letters or letter parts.



    This archive was generated by hypermail 2.1.5 : Sun May 23 2004 - 09:18:10 CDT