Re: Unicode enabled OCR software

From: Kent_Spielmann@sil.org
Date: Tue Jan 31 2006 - 15:49:24 CST

  • Next message: Philippe Verdy: "Re: Unicode enabled OCR software"

    "Mark E. Shoulson" <mark@kli.org> wrote on 01/31/2006 02:42:32 PM:
    > Ow. OCR to *full* Unicode sounds like it would have a lot of potential
    > problems. Within any given alphabet, you can usually count on letters to
    > look somewhat different from each other, but so many Unicode characters
    > resemble other ones, how is the program to know which to output? An "A"
    > might be Latin, Greek, or Cyrillic, and they'd all look identical (not
    > even "similar").
    I'm not sure I defined the issue well enough. The point is that we want to
    define a custom alphabet that contains a limited subset of Unicode
    characters some of which are not in any of the ANSI codepages. Although our
    software allows us to define a custom alphabet, it will allow only
    characters in it that exist in an ANSI codepage.
    Case in point:
    We are scanning Mixtec data with the following letters which are part of
    the official Mixtec alphabet:
                                                                              
       ɨ | 0268 | Latin Small Letter I With Stroke
     --------+--------------------+------------------------------------------
       ɨ̀ | 0268+0300 | Latin Small Letter I With Stroke +
             | | Combining Grave Accent
     --------+--------------------+------------------------------------------
       ɨ́ | 0268+0301 | Latin Small Letter I With Stroke +
             | | Combining Acute Accent
     --------+--------------------+------------------------------------------
       č | 010D | Latin Small Letter C With Caron
     --------+--------------------+------------------------------------------
       Č | 010C | Latin Capital Letter C With Caron
     --------+--------------------+------------------------------------------
       ž | 017E | Latin Small Letter Z With Caron
     --------+--------------------+------------------------------------------
       Ž | 017D | Latin Capital Letter Z With Caron
     --------+--------------------+------------------------------------------
       ʔ | 0294 | Latin Letter Glottal Stop
     --------+--------------------+------------------------------------------
       ⁿ | 207F | Superscript Small Letter N
     --------+--------------------+------------------------------------------
       ʷ | 02B7 | Modifier Small W
                                                                              

    None of the above codepoints can be output by our reader engine (nor any we
    know of).

    Kent



    This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 16:09:21 CST