best character for apostrophe -- limitations of OCR

From: Tom Fruchterman (maverick@raf.com)
Date: Thu Jul 04 1996 - 18:33:00 EDT


Mark Leisher wrote:
>On the other hand, nobody expects OCR software to be smart enough to determine
>the appropriate code for the visually identical glyphs, but these kinds of
>programs can simply default to one consistent codepoint.

   This point may not be of interest to most Unicoders but I darn well
hope OCR software can determine the code for visually identical glyphs
-- the same way you or I would, in context. An O and a 0 are for
practical purposes the same glyph. If you see one in the middle of a
page

                O

you have no reasonable basis for deciding which character it is. OCR
programs use Markov probabilities and dictionaries to great success in
resolving what is an Oh and what is a zero. Similarly, I can imagine
an OCR program that would look for matching `' pairs to say the ' is a
left quote or to realize that in xxxx's the ' is probably an
apostrophe.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT