Re: Bangla

Date: Sun May 23 2004 - 23:32:41 CDT

  • Next message: Towheed Chowdhury: "Re: Bangla"

    Somebody (Probably Omi Azad) informed me that Microsoft is developing OCR for
    Bangla. I have doubt in it, as MS is busy in many other things and Bangla
    market is not so critical to them.On the other hand they have not developed
    any OCR for any language-why they will do it for Bangla?
    However two universities of Bangladesh namely BRAC and Jahangirnagar is
    jojntly working on a project financed by a Candian NGO named Pan Asian Network.
    The project is yet to start. They have OCR in their agenda.

    Quoting Philippe Verdy <>:

    > From: "Towheed Chowdhury" <>
    > > How bangla ocr can be developed using current unicode?
    > ISO/IEC 10646 and Unicode are just standard for character encoding, not for
    > their rendering and presentation.
    > OCR is a difficult problem, but it has nothing in common with characters
    > encoding, as it is an analysis of glyphs.
    > Generally, good OCR recognition is difficult to automate without specific
    > fonts
    > with simplified or slightly altered (but still readable) glyphs.
    > This is not a problem of Unicode.
    > What Unicode has done is only to add some characters that were used in the
    > OCR
    > context (such as symbols on checks, that were created and printed specially
    > for
    > OCR systems, but had no prior meaning in the linguistic and plain-text area:
    > in
    > Unicode these special glyphs are coded as distinctive symbols with their own
    > code points.
    > OCR already has difficulties to recognize accents on modern Latin, Greek or
    > Cyrillic letters, and it does not work well with other scripts (it works
    > with
    > unpointed Hebrew, but fails with Arabic due to the complex joining behavior
    > and
    > too small glyphic differences between glyphs in the most widely used
    > typographic
    > variants of the Arabic script.)
    > I don't know if there has been attempt to recognize Devanagari in India.
    > Hiragana and Katakan may work well in OCR, but generally Japanese texts
    > contain
    > lots of Han ideographs that are very difficult to recognize with OCR due to
    > their graphic complexity.
    > May be there's OCR working with Hangul basic Jamos (written linerarily,
    > instead
    > of with syllabic squares).
    > In all these case, the target encoding when parsing a scanned image of a text
    > is
    > not the issue, as the difficulty is in recognizing abstract characters from
    > many
    > distinct glyph shapes that will alwyas exhibit slight variations when
    > scanned
    > from a printed paper.
    > So you want to search in India if there exists some works to read Devanagari
    > printed texts with OCR (Devenagari is difficult to parse too, like Arabic,
    > because glyphs are most often joined, and this creates difficulties to
    > separate
    > letters or letter parts.
    > --
    > This message has been scanned for viruses and
    > dangerous content by MailScanner, and is
    > believed to be clean.

    This mail sent through, The First Online Internet Service Provider In Bangladesh

    This archive was generated by hypermail 2.1.5 : Sun May 23 2004 - 23:43:55 CDT